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Preface 



Sometimes milestones in the evolution of the DAGM Symposium become 
immediately visible. The Technical Committee decided to publish the symposium 
proceedings completely in English. As a consequence we successfully negotiated with 
Springer- Verlag to publish in the international well-accepted series “Lecture Notes in 
Computer Science”. The quality of the contributions convinced the editors and the 
lectors. Thanks to them and to the authors. We received 105 acceptable, good, and 
even excellent manuscripts. We selected carefully, using three reviewers for each 
anonymized paper, 58 talks and posters. Our 41 reviewers had a hard job evaluating 
and especially rejecting contributions. We are grateful for the time and effort they 
spent in this task. The program committee awarded prizes to the best papers. We are 
much obliged to the generous sponsors. 

We had three invited talks from outstanding colleagues, namely Bernhard Nebel 
(Robot Soccer - A Challenge for Cooperative Action and Perception), Thomas 
Lengauer (Computational Biology - An Interdisciplinary Challenge for 
Computational Pattern Recognition), and Nassir Navab (Medical and Industrial 
Augmented Reality: Challenges for Real-Time Vision, Computer Graphics, and 
Mobile Computing). N. Navab even wrote a special paper for this conference, which 
is included in the proceedings. 

We were proud that we could convince well known experts to offer tutorials to our 
participants: H.-P. Seidel, Univ. Saarbriicken - A Framework for the Acquisition, 
Processing, and Interactive Display of High Quality 3D Models; S. Heuel, Univ. 
Bonn - Projective Geometry for Grouping and Orientation Tasks; G. Rigoll, Univ. 
Duisburg - Hidden Markov Models for Pattern Recognition and Man-Machine- 
Communication; G. Klinker, TU Miinchen - Foundations and Applications of 
Augmented Reality; R. Koch, Univ. Kiel - Reconstruction from Image Sequences; W. 
Forstner, Univ. Bonn - Uncertainty, Testing, and Estimation of Geometric 
Parameters; K.-H. Englmeier, GSF Munich - Computer Vision and Virtual Reality 
Applications in Medicine; W. Eckstein and C. Steger, MVTec GmbH, Munich - 
Industrial Computer Vision. 

I appreciate the support of so many persons and institutions who made this conference 
happen: the generosity and cooperation of the Fachhochschule Miinchen and the 
Technische Universitat Miinchen, the financial and material contributions by our 
sponsors, the helpers who did so much excellent work in preparing and realizing this 
event. Stefan Florczyk, my co-editor, and his team - Ulrike Schroeter, Oliver Bosl - 
did a great job. With the experience and imagination of my colleagues in the local 
committee - W. Abmayr, H. Ebner, K.-H. Englmeier, G. Hirzinger, E. Hundt, G. 
Klinker, H. Mayer - it was a pleasure to create and design this symposium. We all did 
our best and hope that the participants took their personal advantage of the conference 
and enjoyed their stay in Munich. 



July 2001 



Bernd Radig 
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VIII Organization 



Since 1978 the DAGM (German Association for Pattern Recognition) stages annually 
at different venues a scientific symposium with the aim of envisaging conceptual 
formulations, ways of thinking, and research results from different areas in pattern 
recognition, to stimulate the exchange of experience and ideas between the experts, 
and to further the young generation . 

The DAGM e.V. was founded as a registered society in September 1999. Until then 
the DAGM was constituted from supporter societies which have since then been 
honorary members of the DAGM e.V.: 

DGaO Deutsche Arbeitsgemeinschaft fiir angewandte Optik (German Society of 

Applied Optics) 

GMDS Deutsche Gesellschaft fiir Medizinische Informatik, Biometrie und 

Epidemologie (German Society for Medical Informatics, Biometry, and 
Epidemiology) 

GI Gesellschaft fiir Informatik (German Informatics Society) 

ITG Informationstechnische Gesellschaft (Information Technology Society) 

DGN Deutsche Gesellschaft fiir Nuklearmedizin (German Society of Nuclear 

Medicine) 

IEEE Deutsche Sektion des IEEE (The Institute of Electrical and Electronic 

Engineers, German Section) 

DGPE Deutsche Gesellschaft fiir Photogrammetrie und Eernerkundung 

VDMA Fachabteilung industrielle Bildverarbeitung/ Machine Vision im VDMA 
(Robotics + Automation Division within VDMA) 

GNNS German Chapter of European Neural Network Society 

DGR Deutsche Gesellschaft fiir Robotik 
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University Duisburg, Germany, 
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University Kiel, Germany, 

Institute of Computer Science and Applied 
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Model-Based Image Segmentation Using Local 
Self- Adapting Separation Criteria 



Robert Hanek 

Lehrstuhl fur Bildverstehen und Wissensbasierte Systeme, 
Forschungsgruppe Bildverstehen (FG BV), 
Technische Universitat Miinchen, Germany 
http; //uww9 . in . turn . de/people/hanek/ 



Abstract. In this paper we address the problem of model-based image 
segmentation by fitting deformable models to the image data. Prom un- 
certain a priori knowledge of the model parameters an initial probability 
distribution of the model edge in the image is obtained. From the vicinity 
of the surmised edge local statistics are learned for both sides of the edge. 
These local statistics provide locally adapted criteria to distinguish the 
two sides of the edge even in the presence of spatially changing properties 
such cis texture, shading, or color. Based on the local statistics the model 
parameters are iteratively refined using a MAP estimation. Experiments 
with RGB images show that the method is capable of achieving high 
subpixel accuracy even in the presence of texture, shading, clutter, and 
partial occlusion. 



1 Introduction 

Deformable models, also known as snakes or active contours [9], have been proven 
as an efficient way to incorporate application-specific a priori knowledge into 
computer vision algorithms. For example, in order to segment a bone in a medical 
image or in order to visually track a person, models describing the possible 
contours of the objects of interest are used [11,10,2]. The parameters of the 
models specify object properties such as the pose, size, and shape. The problem 
of estimating parameters of curve models from images not only has applications 
in low-level vision such as image segmentation and tracking but also in high-level 
vision such as 3-D pose estimation, camera calibration, 3-D reconstruction, and 
object recognition. 

In this paper we propose a novel method for estimating the parameters of 
deformable edge models from image data. This method can also be applied to 
high-level problems such as 3-D reconstruction and pose estimation. However, 
here we focus on model-based image segmentation. Due to the high number 
of publications on image segmentation in the following only a few aspects of 
the relevant work can be reflected. See our companion paper [7] for a more 
comprehensive version of this publication. 



B. Radig and S. Florczyk (Eds.): DAGM 2001, LNCS 2191, pp. l-§ 2001. 
© Springer-Veiiag Berlin Heidelberg 2001 
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Fig. 1. Mug in front of an inhomogeneous background: a.) color image (see CD), 
b.) color edges detected by a gradient-based approach (C. Steger [14]). The mug and 
the background are not well separated due to the edges within both regions. 



1.1 Previous Work 

The body of work on image segmentation can be roughly classified into three cat- 
egories: (i) edge-based segmentation, (ii) region-based segmentation, and 
(iii) methods integrating edge-based and region-based segmentation. 

(i) Edge-based segmentation relies on discontinuities of image data [1, 
12,3]. The problem of edge-based segmentation is that in practice usually the 
edge-profile is not known. Furthermore, the profile often varies heavily along the 
edge caused by e.g. shading and texture. Due to these difficulties usually a simple 
step-edge is assumed and the edge detection is performed based on a maximum 
image gradient. In Fig. la the color values on either side of the mug’s contour 
are not constant even within a small vicinity. Hence, methods maximizing the 
image gr^ldient have difficulties to separate the mug and the background, see 
Fig. lb. 

(ii) Region-based segmentation methods such as [16, 5] rely on the ho- 
mogeneity of spatially localized features (e.g. RGB values). The underlying ho- 
mogeneity assumption is that the features of all pixels within one region are 
statistically independently distributed according to the same probability density 
function. Often this assumption does not hold. In Fig. la the distributions of 
the RGB values of the mug and the background depend on the locations within 
the image. 

(iii) Integrating methods: especially in recent years methods have been 
published which aim to overcome the individual shortcomings of edge-based and 
region-based segmentation by integrating both segmentation principles [15, 13, 4, 
8]. These methods seek a compromise between an edge-based criterion, e.g. the 
magnitude of the image gradient, and a region-based criterion evaluating the 
homogeneity of the regions. However, it is questionable whether a compromise 
between the two criteria yields reasonable results when both the homogeneity 
assumption and the assumption regarding the edge profile do not hold as in Fig. 
la. 



1.2 Main Contributions 

The main contributions of this paper are as follows: 1 .) Local self-adapting 
separation criteria are used in order to distinguish adjacent regions: While 
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Other methods use certain fixed criteria (e.g. image gradients, homogeneity cri- 
teria, or combinations) to separate adjacent regions, we use local self-adapting 
separation criteria. These criteria are based on local statistics of pixel features 
obtained from the vicinity of the surmised curve. These criteria allow to separate 
the two sides of an edge even in the presence of spatially changing properties, 
such as changing texture, color, or shading. For the computation of the desired 
local statistics an efficient method is proposed. 

2.) A fit between image data and a ‘blurred model’ is proposed: 
in order to increase the capture range, gradient-based methods typically fit the 
model to a blurred image. We take the opposite approach. We use non-blurred 
image data and a ‘blurred model’. Instead of optimizing the relation be- 
tween blurred image data and a single vector of model parameters we optimize 
the relation between the non-blurred image data and a probability distribution 
of model parameters. The advantages are as follows: (i) the capture range is 
enlarged according to the local uncertainties of the model curve which signifi- 
cantly improves the convergence, (ii) Optimizing the fit between an image and a 
‘blurred model’ is in general computationally cheaper than blurring the image, 
(iii) High frequency information of the image data can be used. 

Overview of the Paper: The reminder of this paper is organized as fol- 
lows: in section 2 an overview of the here proposed Contracting Curve Density 
(CCD) algorithm is given. Sections 3 describes the two main steps of the CCD 
algorithm. Section 4 cont£iins an experimented evaluation and finally in section 
5 a conclusion is given. 

2 Overview of the Contracting Curve Density (CCD) 
Algorithm 

The here proposed CCD algorithm estimates the parameters of curve models 
from image data. The CCD algorithm can roughly be characterized as an exten- 
sion of the EM algorithm [6] using additional knowledge. The additional knowl- 
edge consists of: (i) a curve model, which describes the set of possible boundaries 
between adjacent regions, and (ii) a model of the imaging device. The CCD al- 
gorithm, depicted in Fig. 2, performs an iteration of two steps, which roughly 
correspond to the two steps of the EM algorithm: 1. Local statistics of image 
data are learned from the vicinity of the curve. These statistics locally charac- 
terize the two sides of the edge curve. 2. Erom these statistics, the estimation 
of the model parameters is refined by optimizing the separation of the two 
sides. This refinement in turn leads in the next iteration step to an improved 
statistical characterization of the two sides. During the process, the uncertainty 
of the model parameters decreases and the probability density of the curve in 
the image contracts to a single edge estimate. We therefore call the algorithm 
Contracting Curve Density (CCD) algorithm. 

Input: The input of the CCD algorithm consists of the image data I* and the 
curve model. The image data are local features, e.g. RGB values, given for each 
pixel of the image. The curve model consists of two parts: 1.) a differentiable 
curve function c describing the model edge curve in the image as a function of the 
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Input: image data I”, differentiable curve function c, mean m* and covariance 

Output: estimate m» of model parameters and associated covariance 

Initialization: mean m* = m%, covariance £!« = ci ■ 
repeat 

1. learn local statistics of image data from the vicinity of the curve 

(a) compute pixels v in vicinity V of the image curve from c, m# and S* 

Vu 6 V compute vague assignment av(m«,S«) to the sides of the curve 

(b) Vw 6 V compute local statistics S„ of image data Iv 

2. refine estimation of model parameters 

(a) update mean m« by performing one iteration step of MAP estimation: 

m* = argmin x*(>*i*) with 

X*(m*) = -21n[p(Iv = Iv | av(m*, E*),Sv) p(m4. | m^, Si)] 

(b) updated covariance S« from Hessian of x*(m*) 
until changes of m* and S« are small enough 

Post-processing: estimate covariance S* from Hessian of x^(m^) 
return mectn m* and covariance S« 

Fig. 2. The CCD algorithm iteratively refines a Gaussian a priori density p(^) = 
p($ I mi, Si) of model parameters to a Gaussian approximation p(# j m 4 .,S*) of 
the posterior density p(# 1 1*). 

model parameters 2.) a Gaussian a priori distribution p($) = p(# | mi, Si) 
of the model parameters defined by the mean mi and the covariance Si- 
(The superscript ♦ indicates input data.) Depending on the application the quan- 
tities mi and Si may be obtained for example from a training set, by a human 
initialization, or from a prediction over time. 

Output: The output of the algorithm consists of the estimate m^ of the 
model parameters # and the covariance S* describing the uncertainty of the 
estimate. The estimate m# and the covariance S* define a Gaussian approxi- 
mation p($ I m*, S*) of the posterior density p(# 1 1*). 

Initialization: The estimate m* of the model parameters and the associated 
covariance S* are initialized using the mean m^ and covariance S;^ of the a 
priori distribution. The factor ci (e.g. ci = 9) increases the initial uncertainty 
and thereby enlarges the capture range of the CCD algorithm. 

3 Steps of the CCD Algorithm 

The two basic steps of the CCD algorithm, depicted in Fig. 2, are briefly sum- 
marized in this section. Due to the space limitation, only the basic concept can 
be presented here. A more detailed and more mathematical description is given 
in our compzmion paper [7j. 



3.1 Itearn Local Statistics (Step 1) 

The Gaussian distribution of model parameters p{# | m*, ) and the model 

curve function c define a probability distribution of the edge curve in the image. 
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Fig. 3. Contour plot of the windows (weights) used to estimate (learn) local statis- 
tics: the three roughly parallel lines describe the surmised position and uncertainty 
(cr-interval) of the curve. For the pixels on the perpendiculau line (straight) local statis- 
tics Su are computed from the two depicted windows. The windows axe adapted in size 
and shape to the surmised curve and its uncertainty. 



This curve distribution vaguely assigns each pixel in the vicinity of the surmised 
curve to one side of the curve. In step la the set V of pixels v in the vicinity 
of the surmised curve is determined and for the pixels v e V the vague side 
assignments a^(m^, E$) are computed. The components of the assignments a„ 
specify to which extent pixel v is expected to belong to the corresponding side. 
Fig. 4 row b.) depicts for pixels u 6 V the assignments to the lower side of the 
surmised edge. White pixels indicate a quite certain assignment to the lower 
side. 

In step lb local statistics S„, i.e. first and second order moments, of the 
image feature vectors I* are learned from pixels which are assigned to one side 
with high certainty. This is done for each of the two sides separated by the curve. 
In order to obtain the statistics locally adapted windows (weights) are used, see 
Fig. 3. The windows are chosen such that the local statistics S„ can be computed 
recursively. The resulting time complexity of computing S„ for all pixels v e V 
is 0(1V|), where |Vj is the number of pixels in the vicinity V. Note that the time 
complexity is independent of the window size along the curve. 



3.2 Refine the Estimation of Model Parameters (Step 2) 

In the second step, the estimation of the model parameters is refined based on 
a MAP optimization. Step 2a updates the estimate m* such that the vague 
assignments a„(m^, E*) of the pixels u € V fit best to the local statistics S„. The 
feature vectors I* of pixels n € V are modeled as Gaussian random variables. The 
mean vectors and covariances are estimated from the local statistics S„ obtained 
from the corresponding side of pixel v. The feature vectors of edge pixels are 
modeled as weighted linear combinations of both sides of the edge. In step 2a, 
only one iteration step of the resulting MAP optimization is performed. Since the 
vague assignments a„(m*,E^) explicitly take the uncertainty (the covariance 
S*) of the estimate into account the capture range is enlarged according to 
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iteration: error in pixels (left margin, right margin) 
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Fig. 4. a.) Semi-synthetic image showing roses behind a wooden board; the initial error 
is reduced by more thcin 99%. b.) During the process the uncertainty of the curve is 
reduced and the vague side assignments a„ become certain. 



the local uncertainty in the image. This leads to an individually adapted scale 
selection for each pixel and thereby to a big area of convergence, see [7]. In step 
2b, the covariance S* of the estimate m* is updated based on the Hessian of 
the resulting objective function. 

4 Experiments 

In our experiments we apply the proposed CCD algorithm to the segmentation of 
two fundamental types of image features, namely (i) lines which are radially dis- 
torted and (ii) circles. For the sake of a ground truth we first use semi-synthetic 
images. From two images one combined image is obtained by taking for one side 
of the curve the content of image one and for the other side of the curve the 
content of image two. For pixels on the curve the pixel data are interpolated. 

(i) Lines: Fig. 4 row a.) shows such a semi-synthetic image. For different 
iterations the estimated curves are superimposed on the image. During the pro- 
cess the initial error is reduced by more than 99%. Fig. 5 contains real image 
data. The mug has strong internal edges. Shading causes additional variations 
of the mug’s color values. The background contains area of texture as well as 
strong variations of the illumination. After 11 iterations the estimated curve is 
aligned to the real curve without any visible deviation. 

(ii) Circles: Fig. 6 depicts the iteration for a semi-synthetic image. The 
initial error is reduced by more than 99.8% and the final error is less than 5% of 
a pixel. However, for real images with simileirly complex content we assume that 
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Fig. 5. Real image, (sub-image of Fig. la.); the inititd error of the curve is more than 
70 pixels (upper part). After 11 iterations the deviation between the estimated curve 
and the real curve is not visible. 




Fig. 6. The circle and the background are very inhomogeneous. Furthermore, the circle 
is only partially visible. Nevertheless, the initial error is reduced by more than 99.8%. 



the subpixel accuracy is lower due to different effects, such as unknown blurring 
caused by the imaging device. 

The run time of the algorithm scales roughly linearly to the number of used 
pixels. Hence, the initial uncertainty has an important impact on the run time. 
For example on a 500 MHz computer an iteration step using 10.000 pixels (high 
initial uncertainty) takes about 4s. After about 5 iterations the run time is 
reduced to less than Is per iteration. 



5 Conclusion 

We have proposed a novel method for fitting deformable models to image data. 
The method iteratively refines the a priori distribution of the model parame- 
ters to a Gaussian approximation of the posterior distribution. Locally adapted 
criteria are used in order to distinguish the two sides of the surmised curve. 
The separation criteria are learned from local statistics of the curve’s vicinity. 
Pixels which are intersected by the curve are modeled by a mixture of two local 
distributions corresponding to the two sides of the curve. This locally adapted 
statistical modeling allows to separate the two sides of the curve with high 
subpixel accuracy even in the presence of texture, shading, clutter, and partial 
occlusion. 



R. Hanek 



High robustness, a big area of convergence, is achieved by optimizing the 
resulting MAP criteria not only for a single vector of model parameters but 
for a distribution of model parameters. During the process not only the model 
pareuneters are refined but also the associated covmiance. Prom the covariance 
the local uncertainties of the curve are obtained which provide the basis for the 
automatic and local scale selection. In future work we will apply the proposed 
CCD algorithm to high-level model-fitting problems such as 3-D pose estimation 
and 3-D reconstruction. 
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Abstract. In this paper we present SIMBA, a content based image re- 
trieval system performing queries based on image appearance. We con- 
sider absolute object positions irrelevant for image similarity here and 
therefore propose to use invariant features. Based on a general construc- 
tion method (integration over the transformation group), we derive in- 
variant feature histograms that catch different cues of image content: 
features that are strongly influenced by color and textural features that 
are robust to illumination changes. By a weighted combination of these 
features the user can adapt the similarity measure according to his needs, 
thus improving the retrieval results considerably. The feature extraction 
does not require any manual interaction, so that it might be used for 
fully automatic annotation in heavily fluctuating image databases. 



1 Introduction 

Content based image retrieval has become a widespread field of research. This is 
caused by the ever increasing amount of available image data (e.g. via the Inter- 
net), which requires new techniques for efficient access. A manual annotation of 
the images is too time-consuming and cannot keep up with the growth of data. 
Therefore knowledge from computer vision is exploited to adapt algorithms for 
this task. 

A good overview on content based image retrieval is given in M- The features 
used can be roughly classified into: color with/without layout, texture, shape, 
combinations thereof, and motion parameters. 

The two feature types considered in SIMBA are color and texture: The most 
simple systems use a color histogram | 2 |, in which, however, all spatial layout 
information is lost. In US! and 0 therefore the histogram is refined using layout 
or smoothness information. Other systems perform a segmentation of the image 
and compare images based on their features and their absolute or relative loca- 
tions H33|- The segmentation, however, is a crucial problem when considering 
images of general content. Other authors therefore explicitly restrict to special- 
ized applications, that allow for a robust segmentation like images of museum 
objects in front of a homogeneous background 

Texture often is considered in combination with color. Features are diverse, 
e.g. granularity, contrast, and directionality are used in [2|, whereas extended 
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cooccurrence matrices and features like periodicity, directionality, and random- 
ness are considered in 

Within this paper invariant features are developed to characterize images in- 
dependently from absolute object positions. These features are thus well suited 
for image retrieval, as the user will generally consider images similar even if the 
objects have moved. In general objects are subject to quite complex transforma- 
tions when projected into an image plane. We are therefore restricted to approx- 
imating the real transformations by transformations that can be mathematically 
treated with reasonable effort. In P general methods for the construction of in- 
variant features are explained. Here we focus on invariant features for the group 
of translations and rotations (Euclidean motion) . In addition to their theoretical 
invariance to global translation and rotation, these features have proven to be 
robust to independent motion of objects, different object constellations, articu- 
lated objects and even to topological deformations. The method does not require 
error-prone preprocessing of the data (like segmentation) but can be applied di- 
rectly to the original image data. It is of linear complexity which, however, can be 
reduced to constant complexity using a Monte-Carlo estimation of the features 

In section |2| we briefly summarize the construction of invariant feature his- 
tograms. We construct two different kinds of features. One considering color 
mainly and a second one, considering brightness independent of usual illumi- 
nation changes. Based on these features we present a system for content based 
image retrieval called SIMB A in section 0 Then in section 0 we show results of 
queries performed with that SIMBA. Finally a conclusion is given in section 0 

2 Construction of Invariant Features 

We construct invariant features for images M of size M x N hy integrating over 
a compact transformation group G |0|: 

A[f](M):=j^J^f{gM)dg. (1) 

This invariant integral is also known as Haar integral. For the construction of 
rotation and translation invariant grey scale features we have to integrate over 
the group of translation and rotation: 

1 /*27t 

A[/](M) = / / f{g{to,ti,<f)M)dipdtidto. (2) 

znJyM J to=o J ti=o J ip=o 

Because of the discrete image grid in practice the integrals are replaced by sums, 
choosing only integer translations and varying the angle in discrete steps apply- 
ing bilinear interpolation for pixels that do not coincide with the image grid. 

For kernel functions / of local support the calculation of the integral generally 
can be separated into two steps: First for every pixel a local function is evaluated, 
which only depends on the grey scale intensities in a neighborhood disk according 
to /. Then all intermediate results of the local computations are summed up. 
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Instead of summing up the results, a histogram can be calculated, thus keep- 
ing more information about the local results. Furthermore different kernel func- 
tions / or multiple color layers can be combined into multidimensional (joint) 
histograms. Instead of using a traditional histogram we apply a fuzzy histogram 
^ to get rid of the discontinuous behavior of the traditional histogram. 

Kernel functions of monomial type (like /(M) = M(4, 0) • M(0,8)) in com- 
bination with fuzzy histograms have been successfully applied to the task of 
texture classification m, texture defect detection |H], and content based image 
retrieval m 

In this paper we additionally consider textural features, that are invariant to 
standard illumination changes. These are constructed using the method above 
and applying a relational kernel function of type 

/(M)=rel(M(xi, 2 /i)-M(cr 2 , 2 / 2 )) (3) 

with the ramp function 

r 1 <5 < -e 

rel(<5) = \ ^(e — — e < 5 < e (4) 

[ 0 e < 6 

centered at the origin. 

Our relational kernel function is motivated by the local binary pattern (LBP). 
LBP is a well known method for the construction of textural features |S1 . This 
operator maps the relations between the grey scale values of a center pixel and its 
3x3 neighborhood pixel onto a binary pattern. The most important property of 
the LBP is its invariance against strictly increasing grey scale transformations. 
If we set e to zero we get exactly the comparison function used for the LBP 
operator. 

The problem with a simple comparison, as done by the LBP operator, is the 
discontinuity of the operator. In the worst case, a small noise within the image 
results in a big deviation of the feature. If we add a continuous transition (e > 0) 
we become more robust to noise, but loose a bit of invariance against monotonic 
grey scale transformations. 

Again, equation 0 can be evaluated in two steps: The chosen kernel function 
/ defines a pattern of two pixels that have to be compared with the rel-function. 
This pattern is rotated around each image pixel (local evaluation) and then 
translated to all pixel positions (global averaging). Again it is suggestive to do 
a histogramming instead of the averaging. We thus create a 3-bin histogram for 
every local evaluation first. Roughly speaking this gives for every image pixel 
the fractions of pixel pairs with relation ’darker’, ’equal’, and ’lighter’. By using 
the fuzzy histogram mentioned above we keep smooth transitions between these 
three classes. Afterwards an overall joint histogram of these local fractions is 
constructed from all local evaluations. 

As a result we obtain a feature vector characterizing the local activities within 
the image, which is invariant to image translations and rotations and robust to 
standard brightness transformations. 
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3 SIMBA, an Online Prototype for Content Based Image 
Retrieval 

Based on the methods presented above we implemented SIMBA, an appear- 
ance based image retrieval system containing nearly 2500 photograph images. 
The user can combine features that are strongly influenced by the image color 
(monomial type kernel functions evaluated on each color layer) with the pro- 
posed textural features that are calculated from the luminance layer only. By 
assigning weights to these features the user can adapt his query according to his 
needs or according to the image content. 

SIMBA is constructed as a client-server system, thus providing faster query 
performance and the possibility of data protection. The search client, which 
can be located on any computer in the Internet, analyzes the query image and 
only sends the extracted features to the database server. The query image stays 
private as it does not leave the client’s computer. The database server holds pre- 
calculated features of all database images. When a query request from a client 
arrives it performs a weighted nearest neighbor query (in terms of a histogram 
intersection) and returns the resulting image names (or URLs) and additional 
information like the match- value to the client. 

Currently SIMBA runs on a SGI 02 MIPS R5000 (web-client) and an IBM 
RS6000 (server)El But it also runs on Windows as well as various other Unix 
systems and can be accessed from anywhere in the Internet. 

4 Results 

We present results obtained with SIMBA using the MPEG-7 test dataset. As 
told before, the monomial features consider color and thus the results are very 
intuitive for the user. The texture features based on relational kernel functions 
are sometimes less intuitive for the user. This is due to the fact, that we do not 
use a texture database containing several examples of the same texture but use 
real world images. At least the user might comprehend the results in terms of 
’highly textured image parts’ or ’homogeneous image parts’. 

Figures Q to 0 show example queriefl We oppose results using monomial 
and relational feature kernels, or combinations thereof. Note, that the examples 
were intentionally chosen that way, that the results are not satisfying when using 
monomial kernels only. Often, however, the results using monomial kernels are 
sufficient already m- 

The query image in figure Q] displays a sunset. There is no sunset image in 
the database with similar color, thus the result of the monomial kernel functions 
is not satisfying (first query in figure Q . Applying relational kernel functions the 
smooth gradient characteristics of different sunsets in the database are caught 

^ http: / / simba.informatik.uni-freiburg.de 

^ We acknowledge Tristan Savatier, Aljandro Jaimes, and the Department of Water 
Resources, California for providing the displayed images within the MPEG-7 test 
set. 
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Search template 




87107 8 $4349 9 54076 1 51989 6 




49918.5 



50625 




Search templaie 




87896 7 85192 1 84943 7 844194 




84312 2 83914 83882 5 83126 6 



Fig. 1. Query examples from SIMBA. The first query is based on monomial feature 
kernels only. The next query uses relational kernels only. The upper centered image of 
each query displays the query template. Below this, the top eight results are shown 
with the corresponding histogram intersection values. 



better (second query in figureOl). The query images in figures El and El have very 
characteristic color and texture. The query results using a monomial kernel only 
are not satisfying (first queries of figures El and ED • Combining monomials and 
relational kernels, however, both color and texture are considered well, so that 
several images with steel reinforcement resp. railings are found (second queries 
of figures El and 0) . 
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Search template 





S7364-1 75550 73818 3 68662 




681815 67894 8 65086 25 64739 7 



Fig. 2. Query examples from SIMBA. The queries displayed at the top are based 
on monomial feature kernels only. The second query was performed using an equally 
weighted combination of monomial and relational kernel functions. The upper centered 
image of each query displays the query template. Below this, the top eight results are 
shown with the corresponding histogram intersection values. 



5 Conclusion 

In this paper we presented a content based image retrieval system based on 
invariant features. A previous system that paid much attention to color was 
extended by novel invariant textural features. By weighting these feature types 
according to the needs or according to the image type, the user is able to improve 
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Search template 




87445 6 67087 9 64599 5 64473 6 




635264 



63031 8 




Search template 




Fig. 3. Query examples from SIMBA. The queries displayed at the top are based 
on monomial feature kernels only. The second query was performed using an equally 
weighted combination of monomial and relational kernel functions. The upper centered 
image of each query displays the query template. Below this, the top eight results are 
shown with the corresponding histogram intersection values. 



the retrieval results considerably. In contrast to many existing image retrieval 
systems SIMBA does not rely on error-prone preprocessing steps (like segmenta- 
tion) but derives its features directly from the original image data. All transfor- 
mations performed are continuous mappings thus ensuring a smooth degradation 
of the features if the image data is changed moderately. As none of the methods 
requires manual interaction, the system might also be used for fully automatic 
generation of annotations in heavily fluctuating image databases. 
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Abstract. In this paper we propose a novel method for the construction 
of invariant textural features for grey scale images. The textural features 
are based on an averaging over the 2D Euclidean transformation group 
with relational kernels. They are invariant against 2D Euclidean motion 
and strictly increasing grey scale transformations. Beside other fields of 
texture analysis applications we consider texture defect detection here. 
We provide a systematic method how to apply these grey scale features 
to this task. This will include the localization and classification of the 
defects. First experiments with real textile texture images taken from 
the TILDA database show promising results. They are presented in this 
paper. 



1 Introduction 

Among other basic image primitives, e. g. edges, texture is one important cue 
for image understanding. The lack of an universal, formal definition of texture 
makes the construction of features for texture analysis hard. Few approaches 
exist in the literature how texture can be qualitative defined, e.g. |3. 

Generally, real world textures underlie transformations which make the con- 
struction of textural features difficult. These transformations affect the intensity 
values, the position, and the orientation of the texture. For texture analysis pur- 
poses textural features have to cope with one ore more of these transformations. 

In this paper, we propose a novel method for the construction of textural 
features from grey scale images which are invariant with respect to rotation, 
translation, and strictly increasing grey scale transformations. Our approach 
combines two techniques: one for the construction of invariance with respect to 
2D Euclidean motion and one for the construction of invariance against strictly 
increasing grey scale transformations. 

Beside other fields of texture analysis applications we consider the texture 
defect detection (TDD) as an application for our features. The TDD is one im- 
portant topic of automatic visual inspection. Over the last decade the automatic 
visual inspection of technical objects has gained increasing importance for in- 
dustrial application. An extensive review on TDD methods can be found in HH. 
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Song et al. classify the methods for TDD into two categories: local and global 
ones. According to this classification we provide a local method for the TDD with 
our textural features, including localization (segmentation of defective regions) 
and classification of the defects. 

This paper is organized as follows: In section |21 we introduce the theory of 
our textural features. A systematic method for TDD, including localization and 
classification, is given in section 0 First experiments and results are presented 
in section E] A conclusion is given in section El 

2 Constructing Invariant Textural Features 

Schulz-Mirbach proposed a method for the construction of invariant features 
A[f] for two-dimensional grey scale images mQ by integrating over a compact 
transformation group G m 



A[f]{M):=^J f{gM)dg. (1) 

G 

The invariant integral is also known as Haar integral. For the construction of 
rotation and translation invariant grey scale features we have to integrate over 
the 2D Euclidean transformation group: 

W H 2tt 

A[f]{M) = III (2) 

tx—0 ty—0 ip—0 

where {tx,ty^f} denote the transformation parameter. Because of the discrete 
image grid in practice the integrals are replaced by sums, choosing only integer 
translations and varying the angle in discrete steps applying bilinear interpola- 
tion for pixels that do not coincide with the image grid. 

For kernel functions / of local support the calculation of the integral gener- 
ally can be separated into two steps: First for every pixel a local function / is 
evaluated, which only depends on the grey scale intensities in a specific neigh- 
borhooc0 Then all intermediate results of the local computations are summed 
up. 

An alternative approach for the second step of the computation strategy 
is the usage of fuzzy histograms [Q. The fuzzy histogram (FZH) was created 
mainly to get rid of the discontinuous behavior of the traditional histogram. 
The combination of invariant integration with monomials as kernel functions and 
the FZHs has been successfully applied to the task of texture classification cni, 
texture defect detection [3, and content based image retrieval |S|. 

^ The grey scale image is defined as M : (x, y) [0, D], (x, y) £ [0, W — 1] x [0, iif — 1], 
where W and H denote the width and heigth, and D denotes the maximum grey 
scale value of the image. 

^ The neighborhood is on circles around the actually processed pixel, whereby the 
radius of the circle is defined by /. 
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However, by using absolute grey scale values for the construction of the in- 
variant feature histograms we are susceptible to inhomogeneous illumination 
conditions. A local approach for the elimination of illumination influences is 
the construction of features with local support which are invariant resp. robust 
against strictly increasing grey scale transformations. 

In general we define a relational function rel() which maps the difference S 
of two grey scale values to a real number. One simple relational function and its 
plot is shown in Fig. □ The grey scale difference S of two neighbored pixels is 
mapped to the interval [0, 1]. 
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Fig. 1. The definition of the relational function rel() and its plot which maps the grey 
scale difference S to [0, 1]. D denotes the maximum grey scale value. 



The relational kernel function (RKF) is motivated by the local binary pattern 
(LBP). LBP is a well know method for the construction of textural features |3|. 
This operator maps the relations between the grey scale values of a center pixel 
and its 3 X 3 neighborhood pixel onto a binary pattern. The most important 
attribute of the LBP is the invariance against strictly increasing grey scale trans- 
formations. But, the LBP feature is not rotation invariant. If we are shifting the 
threshold e of rel() towards zero we get exactly the comparison function used for 
the LBP. 

A simple comparison of adjacent pixels as done by the LBP operator results 
in an instable property of the feature. In the worst case, a small perturbance 
of the image results in a big perturbance of the feature. If we add robustness 
against noise (e > 0) we loose invariance but keep robustness against strictly 
increasing grey scale transformations. Since we are not using only the values ”0” 
and ”1” — like the LBP operator — all S which are inside the interval [— e, e] 
lead to a continuous result of rel(). 

Now we formulate the combination of the feature construction technique for 
rotation and translation invariance with the RKFs. Let pi = {ri,ipi},i = 1,2 
define a parameter set of two circles, where G denotes the radius and 
4>i G [0,27t[ defines the phase. We assume a sufficiently high and for all circles 
equal sampling on all discrete circles. A sufficient heuristic definition of the 
sample rate is s = 8 |"maxi{ri}]. Samples which do not fall onto the image grid 
must be interpolated, e.g. bilinearly. 

The sampling of two circles and the computation of the rel()-function is 
explained in Fig. |21 The functions nip^ and nip^ denote the interpolated grey 
scale values on the circles. On all grey scale samples of the circles the difference 



20 



M. Schael 



y 




"ipi O') 

D 



"TllTTljlniTTlI 



. 0 ) 




Fig. 2. On the left side two different circles with the parameters pi = {1,7 t/ 8} and 
P 2 = {2,77t/ 8} and the sample rate s = 16 are plotted. The upper and the middle 
diagram on the right show the corresponding grey scale values of the circles, where 
j = 1, . . . , 16 denotes the sample number. All differences 6{j) := rrip^{j) — m,p^{j) are 
mapped with the rel() function to the interval [0, 1]. D denotes the maximum grey 
scale value. 



5 is computed. Each grey scale difference is mapped by the function rel() to a 
value between the interval [0, 1]. As a result we obtain a measure for the grey 
scale relations in the neighborhood of the center pixel. 

To achieve invariance against rotation we estimate a FZH on all local rel() 
results. For each pixel we get a FZH with three bins (with bin centers at 0.0, 
0.5, and 1.0). I.e., for each RKF rel() we get three feature maps of the size of 
the input image minus the support of the RKF. These feature images can be 
combined by using a three-dimensional joint feature distribution (FD). 



3 Texture Defect Detection 

We consider a planar textured surface of technical objects of a known texture 
class (TC), e.g. textile surfaces, and restrict to periodical textures which can be 
anisotropic. The texture is disturbed by defects of different classes. Orientation 
and position of the texture and the defects are unknown. Since the distance 
between the camera and the surface is fixed, we do not need to consider scaling. 
Furthermore, camera plane and surface are parallel. Thus, there are no projective 
transformations. 

Our method for the texture defect detection is based on the comparison of 
FDs estimated from the local results of the RKFs. This approach is based on the 
assumption that a defective region of the texture results in a significant deviation 
of the FD. In a first step of the supervised training an average reference FD of 
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the defect-free class — the reference class — is estimated. For this step we need a 
set of defect-free sample images. From all FD results of the samples we compute 
the average FZH. 

For the localization of the defective regions (segmentation of the defects) 
on the texture we are working with subimages. Depending on the scale of the 
texture and the size of the texels (texture elements) a subimage size is chosen. 
Each sample texture is divided into non-overlapping patches. Inside all patches 
we estimate the three-dimensional FZHs of the local RKF results. Using a FD 
distance measure, e. g. the similarity between the reference FD and the patch 
FD is calculated. Based on this FD distances we obtain a threshold distance 0 
which allows the discrimination between the reference and the defect state. 

The second step of the training phase comprises of the estimation of the 
defect models. Again we have to compute the FD for the patches of all defect 
sample image^. A defect is detected if the distance between the patch FD and 
the reference FD is significant. Since the training is a supervised learning we are 
able to use the defective patches of a sample image to estimate a defect model 
FD for this class. 

After the training phase we have a model FD of each class. In the following 
test phase a texture image of unknown state is analyzed. The FDs of all subim- 
ages — according to the four different image partitions — are estimated and 
compared to the reference FD. All ~x^ distances of the subimage FDs and the 
reference FD were computed. A distance greater than the threshold 9 detects 
a defective subimage. The classification of the subimage is done by a minimum 
distance classifier. Thus, the minimum distance to a defect model FD assigns 
the defect class. Using different RKFs allows the adaption of the method to 
different defects. 



4 Experimental Results 

First experiments were made with real textile images taken from the TILDi^ 
database jSj. Two different TCs {ClRl and C2R2), four defect classes {El to 
E4) and a defect-free reference class (EO) were used. The defect classes consist 
of El: holes, E2: oil spots, E3: thread errors and E^: objects on the surface. 

As described in sectionElthe training phase was conducted first. The reference 
FD model for both TCs was estimated with ten image samples. Thereafter, the 
localization of the defective image regions was performed on the defect images. 

® To ensure that the deviation of the subimage FD to the reference FD is significant 
we use four different partitions of the image. For each partition the subimage size is 
equal but the start positions are shifted by the half width or height of the subimage 
size, respectively. 

The textile texture database was developed within the framework of the working 
group Texture Analysis of the DFG’s (Deutsche Forschungsgemeinschaft) major re- 
search program “Automatic Visual Inspection of Technical Objects” . The database 
can be obtained on two CD-ROMs on order at the URL http:/ /Imb.informatik.uni- 
freiburg.de / tilda. 
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Fig. 3. Segmented defects of texture class ClRl. 



Examples of the localized defective regions on the images of both TCs are shown 
in Fig. 01 and Fig. 01 In both figures the defect classes from left to right are El 
to E4- The upper row of both figures shows the selected defect images of size 
256 X 256. Chosen patch sizes for the TC are: ClRl: 32 x 32, and C2R2: 40 x 40. 
Used parameter sets for the RKF: ClRl: pi = {2, 0}, p 2 = {10, 0}, s = 80, and 
C2R2: Pi = {2,0}, P 2 = {15,0}, s = 120. The threshold e = 5 was the same for 
both TCs. 




Fig. 4. Segmented defects of texture class C2R2. 
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At this state of the experiments no FD for defect classes were estimated. 
However, we have computed the inter-class FD distances of the defective image 
regions by using the Jensen-Shannon (JS) divergence, also known as Jeffrey 
divergence. This divergence measure was empirically analyzed in 0. Puzicha 
et al. achieve good color and texture classification results with the and the 
JS measure. Figure 0 reveals the JS distances of both analyzed TCs ClRl and 
C2R2. 




Fig. 5. FD distances between the defect classes El to E4 measured with the JS- 
divergence. On the left: results of ClRl, on the right results of C2R2. 



All inter-class FD distances of both texture classes indicate the good discrim- 
inative properties of the features. Thus, the classification of the defects using the 
proposed textural features should be possible. 

5 Conclusion 

In this paper we presented a novel technique for the construction of textural 
features which are invariant with respect to 2D Euclidean motion and robust 
against usual grey scale transformations. A systematic method for the applica- 
tion of the features to the texture defect detection task was given. It includes the 
localization and classification of the defects. First experiments with our features 
and the TILDA database show promising results. 

The size of the detectable defects depends on the chosen patch size. Thus, the 
scale of the texture and the patch size must be properly selected. If the spatial 
extension of a defect is smaller than the patch size, the deviation of the FD 
might be to small for a detection. However, all defects of the considered texture 
images from TILDA could be detected. Using suitable RKFs allow the adaption 
of the method to different defect and texture classes. 

In a next step the experiments with the TILDA database must be extended. 
The proposed textural features are also applicable to the area of content based 
image retrieval and texture classification. 
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Abstract. We describe a method to decompose binary 2-D shapes into 
layers of overlapping parts. The approach is based on a perceptual group- 
ing framework known as tensor voting which has been introduced for 
the computation of region, curve and junction saliencies. Here, we dis- 
cuss extensions for the creation of modal/amodal completions and for 
the extraction of overlapping parts augmented with depth assignments. 
Advantages of this approach are from a conceptual point of view the close 
relation to psychological findings about shape perception and regarding 
technical aspects a reduction of computational costs in comparison with 
other highly iterative methods. 



1 Introduction 

The phenomenon of spontaneously splitting figures (Fig. 1) has been discussed 
in several psychological studies as an example for the existence of perceptual 
organization in human shape perception. Kellman and Shipley [7], who present 
a psychologically motivated theory of unit formation in object perception, note 
that demonstrations have been easier to develop than explanations. 

With respect to computer vision, a method for part decompositions studied 
on spontaneously splitting figures is considered to yield a natural solution for the 
hard problem of object recognition in the presence of occlusions [1,6, 11, 15-17]. 
However, aiming at the decomposition of the input image into disjunct partitions, 
other segmentation schemes do not provide adequate means for the handling 
of occlusions and often yield intuitively implausible parts - among them the 
symmetric axis transform by [2,3], the transversality principle by [6], process- 
grammars by [10], and evolving contours by [8,9]. A detailed discussion may be 
found in [12]. Here, we intend to merge findings from both disciplines in order 
to develop a method for the extraction of overlapping forms by the application 
of a perceptual grouping algorithm. 



2 The Tensor Voting Approach 

For the development of our unit formation process, we apply different techniques 
from the tensor voting framework, a perceptual saliency theory introduced by 
Medioni et al. [4, 13]. We consider as the main advantage of this approach over 
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Fig. 1. Spontaneously splitting figures: Despite of the homogenous surface quality, the 
shapes tend to be seen as two overlapping forms. The figure seen on top of the other 
is modally completed while the other one is amodally completed behind the nearer 
unit. Although the depth ordering may alternate, the units formed remain the same, 
a) Example from [7]. b), c) Examples used as input images in further demonstrations. 



other methods, like [ 18 - 21 ], that the computation is not iterative, i. e. does not 
require iterations which are necessary for regularization, consistent labeling, clus- 
tering, robust methods, or connectionist models. Furthermore, tensor voting does 
not require manual setting of starting parameters and is highly stable against 
noise while preserving discontinuities. Multiple curves, junctions, regions, and 
surfaces are extracted simultaneously without the necessity to define an object 
model. 

To explain the principles of the tensor voting method, let’s consider the 

T 

basic type of input data first which are oriented tokens, edgels e = [tx,ty\ , 
given in a sparse map. In contrast to other vector-based methods - e. g. [ 5 , 14 , 
21 ] - the tokens are encoded as second-order symmetric tensors T = = 

[[t^ txty^^ , \^txty ty ]] which in the 2 -D case can be interpreted as the covariance 
matrix of an ellipse. Thus, each token can simultaneously carry two parameters: 
The degree of orientation certainty encoded by the elongation of the ellipse (i. e. 
a circle represents maximal uncertainty of orientation which is a feature of a 
junction) and a saliency measure encoded by the size of the ellipse. Given edgels 
as input tokens, initially all tensors have maximal elongation and look like sticks. 
Grouping between the sparse input tokens is deducted from the information 
propagated by each token into its neighborhood. This interaction is encoded 
into the layout of an appropriate tensor field, here the stick field depicted in 
Fig. 2 a, which is placed at each input location after a rotational alignment. 
Then, all fields are combined by a tensor addition. Hence, this voting operation 
is similar to a convolution. The resulting general tensors carry information about 
curve saliencies and junction saliencies which are computed by the decomposition 
T = (Ai — A2)c2e^ + \2{eii^ + €2^) where Ai > A2 > 0 denote the eigenvalues 
of T and ei , 62 the corresponding normalized eigenvectors. The first term of the 
sum represents a stick tensor (in direction of the curve orientation) weighted by 
the curve saliency Ai — A2 while the second term is a ball tensor weighted by the 
junction saliency A2 . Finally, the result of the perceptual grouping is derived by 
extracting the maxima in the saliency maps, for curves a marching algorithm is 
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a) b) c) 

Fig. 2. Voting Fields: a) Stick field encoding the information propagated from an hor- 
izontally oriented token (center) into its neighborhood. The field tries to find a contin- 
uation with minimal curvature, comparable to the Gestalt law of good continuation. 
It decreases with growing curvature and distance, b) Ball field for non-oriented data, 
c) Modified stick field used for voting between junctions. 



employed to trace them along their highest saliencies found perpendicularly to 
the orientations of the stick tensors. 

If the input data consists of non-oriented tokens, i. e. points, it can easily be 
transformed into oriented data by a preceding voting step with a different field, 
called ball field (Fig. 2b), which is built by integrating the stick field over a 180° 
rotation. 



3 Segmentation of Spontaneously Splitting Figures 

Our method for the decomposition of forms into multiple, possibly overlapping 
layers applies mechanisms of the tensor voting framework at different process- 
ing stages as detailed in [12]. First, the binary input images are transformed by 
application of an isotropic ball field into a boundary saliency map from which 
contours (including polarity information) and junctions are extracted by appli- 
cation of a stick field. 

Motivated by several psychological studies (cf. [7] for a survey), we use the 
junctions along the contour as seeds for interpolated edges which connect so- 
called relatable edges to form part groupings. This step is implemented as a 
newly introduced voting between junctions: In search for a continuation of the 
abruptly ending line segments, each junction sends out votes in the opposite 
direction of its two incoming edges. The underlying voting field is defined as a 
semi-lobe of the stick field (Fig. 2c) with a smaller opening angle to emphasize 
straight continuations and to get a reduced position uncertainty compared to 
general input domains. Overlapping fields of interacting junctions yield high 
curve saliencies (Fig. 4b) which are traced by a marching algorithm along the 
local maxima to form amodal and modal completions of part boundaries, in the 
following briefly called virtual contours. The junctions we have used as seeds are 
formed by exactly two crossing edges on the real contour because the input is 
given as a binary image. We will call them To-junctions where the index denotes 
the number of incoming virtual contours. 
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One important type of junctions, the T-junctions, needs special considera- 
tion. These junctions are created by a prior step of junction voting as interactions 
of voting fields send out by T-junctions with opposite boundary tokens. Then, 
within the tensor voting framework, T-junctions are extracted as locations of 
high junction saliencies. For each T-junction its stem is formed by the incoming 
virtual contour and augmented by a region polarity information deduced from 
the polarity at the T-junction by which it has been created. Each T-junction will 
vote into the direction of its stem together with the set of T-junctions which vote 
as described before. Hence, after two steps of junction voting, we get all possi- 
ble virtual contours as high curve saliencies which are extracted by a marching 
algorithm. 

For the introduction of a unit formation process, it will be necessary to 
distinguish between different kinds of junctions as each type indicates distinct 
shape decompositions. After the extraction of the virtual contours, we reinspect 
each To-junction and look if one or two virtual contours have been generated in 
their vicinity. In that case the junction is relabeled to an Ti- or T2- junction, 
respectively. Then, the T-junctions which are induced by an Ti -junction are 
relabeled to Ti and those induced by an T2-junction to T2. 

The interpretation of an To-junction is simply no indication for a decompo- 
sition but merely a discontinuity on the contour. Whereas an Ti -junction gives 
evidence of a decomposition into two disjunct parts with the virtual contour be- 
longing to both outlines. Depending on the part to which it has been assigned, 
the virtual contour carries one of two opposite polarity vectors. An T2 -junction 
could be interpreted as the coincidence of two Ti -junctions with each virtual con- 
tour carrying two opposite polarities, thus yielding two disjunct parts (Fig. 3 a). 
However, it could also be the case that both virtual contours are formed by 
two overlapping parts with only one direction of each polarity pair being valid 
(Fig. 3 b). The discrimination between both decompositions is achieved by the 
walking algorithm described below. Similarly, Ti-junctions (Fig. 3 c) indicate a 
decomposition into two disjunct parts sharing the T-stem with opposite polar- 
ity assignments while T2-junctions (Fig. 3 d) imply two overlapping parts which 
share one half of the T-bar with identical polarity directions. 




a) b) c) d) 

Fig. 3. The different junction types together with the inferred shape decompositions: 
An T 2 -junction with virtual contours belonging to two disjunct (a) and two overlapping 
parts (b). And a Ti-junction (c) compared to a T 2 -junction (d). 



Segmentation of Spontaneously Splitting Figures into Overlapping Layers 29 




Fig. 4. Results of the part decompositions and depth assignments (see text). Note that, 
in order to show the depth assignments, the last column (d) depicts a projection of 
3-D data with a rotation of approx. 30° about the depth-axis and different gray levels 
corresponding to the depth values. 



As the computations applied up to now are local operations, we have intro- 
duced a traversal algorithm, called walking, to collect this local information into 
a globally consistent shape description. It operates on the adjacency graph of the 
junctions augmented by their types and an assignment of all potential polarity 
vectors for each direction meeting at the junctions. Li- and ^ 2 -junctions initiate 
as seeds outgoing “walks” into an unvisited direction by sending out a cursor, 
called walker, which holds information about the current walking direction and 
region polarity. At the adjacent junction, the incoming direction and polarity is 
transformed into an outgoing walker based on a set of predefined rules for each 
junction type. These rules are designed to give a continuation which preserves di- 
rection and polarity (details in [12]). If a straight continuation cannot be found, 
a change in direction together with an adapted polarity vector is inferred from 
the junction type and entries of already visited contour parts. Cyclic walks are 
considered as successful part decompositions. 

The unit formation process described so far explicitly does not rely on depth 
information. This independence is in agreement with psychophysical findings. 
We rather propose to infer depth arrangements after the unit formation by ex- 
amining the length of the virtual contours. It is known from experiments on 
spontaneously splitting figures that large parts are perceived to lie closer to the 
viewer than small parts [7]. Therefore, we inspect the junctions at overlapping 
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shapes and assume parts with short virtual contours - indicated by higher curve 
saliencies - to lie on top of parts with longer virtual contours. At this point, 
we can classify the unifiedly named virtual contours as modal or amodal com- 
pletions, respectively (Fig. 1). Note that an earlier decision, prior to the depth 
assignments, would have been misleading as it is not required for the unit for- 
mation which remains unchanged in case of bistable depth orderings. 



4 Results and Conclusion 

Figure 4 illustrates the results from this processing on the two examples of 
Fig. lb,c. The first column depicts the extracted region boundaries and Lq- 
junctions (marked by pairs of arrows). The second column shows the curve 
saliency maps resulting from the junction voting, where higher saliency val- 
ues are represented by darker pixels. The third column contains the different 
junction types overlayed with real and virtual contours. The results of the part 
decomposition and the depth ordering can be found in the last column. 

Currently, we are working on extensions of the framework for the realization 
of bistable depth orderings and the handling of cyclically overlapping shapes. 
This includes further research to decide whether depth assignments should be 
global and constant over parts or whether object units can be perceived to stretch 
across different depths. 
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Abstract. In this paper an on-line handwriting recognition system 
with focus on adaptation techniques is described. Our Hidden Markov 
Model (HMM) -based recognition system for cursive German script 
can be adapted to the writing style of a new writer using either a 
retraining depending on the maximum likelihood (ML)-approach or an 
adaptation according to the maximum a posteriori (MAP)-criterion. 
The performance of the resulting writer-dependent system increases 
significantly, even if only a few words are available for adaptation. 
So this approach is also applicable for on-line systems in hand-held 
computers such as PDAs. This paper deals with the performance 
comparison of two different adaptation techniques either in a supervised 
or an unsupervised mode with the availability of different amounts 
of adaptation data ranging from only 6 words up to 100 words per writer. 

Keywords: online cursive handwriting recognition, writer independent, 
adaptation of HMMs 



1 Introduction 



During the last years, the significance of on-line handwriting recognition systems, 
e.g. pen based computers or electronic address books, has increased (compare 
1117191111 1. Regardless of the considered script type (i. e. cursive script, printed 
or mixed style), Hidden Markov Models (HMMs, see PI|) have been established 
because of several important advantages like segmentation-free recognition and 
automatic training capabilities. 

Although the performance of recognition systems increases, the error rate of 
writer independent recognizers for unconstrained handwriting is still too high 
- at least for some writer types - for a real application. This problem leads to 
the implementation of adaptation techniques, which are well known in speech 
recognition f/jl.’lldlbj . Apart from some exceptions H2|, most of the adaptation 
techniques are still unestablished in the area of cursive (on-line) handwriting 
recognition. 

An important aspect for a practical usage of adaptation methods is the 
amount of adaptation data, which is needed to reduce the error rate signifi- 
cantly. It is often impossible and not very user-friendly to select several hundreds 
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of words from a specific writer, which would be necessary to train a basic recog- 
nition system. Thus, the objective should be to conduct a parameter adaptation 
starting from a general writer independent system by using only a few words. 
Another aspect is the availability of labeled adaptation data. It could be very 
useful, when a recognition system can be adapted automatically and unsuper- 
vised to a new writing style with unlabeled data, while using the system in the 
test-mode. 

In the following sections our baseline recognition system and some investi- 
gated adaptation techniques are described (compare also 0BC2|)- Results are 
given using the maximum likelihood (ML) retraining and the maximum a pos- 
teriori (MAP) adaptation comparing the performance of an unsupervised and 
supervised adaptation. 



2 Baseline Recognition System 

Our handwriting recognition system consists of about 90 different linear HMMs, 
one for each character (upper- and lower-case letters, numbers and special char- 
acters like general the continuous density HMMs consist of 12 states 

(with up to 11 Gaussian mixtures per state) for characters and numbers and 
fewer states for some special characters depending on their width. To train the 
HMMs we use the Baum- Welch algorithm, whereas for recognition, the features 
are used to find that character sequence by a Viterbi algorithm, which is the most 
probable for the detected state-probabilities. The presented recognition results 
refer to a word error rate using a dictionary of about 2200 German words. 

2.1 Database 

For our experiments we use a large on-line handwriting database of several writ- 
ers, which is described in the following. The database consists of cursive script 
samples of 166 different writers, all writing several words or sentences on a digi- 
tizing surface. The training of the writer independent system is performed using 
about 24400 words of 145 writers. Testing is carried out with 4153 words of 21 
different writers (about 200 words per writer). 

For each of these 21 test writers up to 100 words are available for adaptation 
and nearly 100 words are used for evaluation of the developed writer dependent 
system. Some examples of the resampled test-set are shown in FigD 

2.2 Feature Extraction 

After the resampling of the pen trajectory in order to compensate different writ- 
ing speeds the script samples are normalized. Normalization of the input-data 
implies the correction of slant and height. Slant normalization is performed by 
shearing the pen trajectory according to an entropy-criterion. For height nor- 
malization the upper and lower baselines are estimated to detect the coreheight 
of the script sample, which is independent of ascenders and descenders. These 
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Fig. 1. Some examples of the handwritten database (7 different writers) 



baselines are approximated by two straight lines determined by the local minima 
resp. maxima of the word. 

In our on-line handwriting recognition system the following features are de- 
rived from the trajectory of the pen input (compare also jl 1 1): 

— the angle of the spatially resampled strokes (sin a, cos a) 

— the difference angles (sin Aa, cos Aa) 

— the pen pressure (binary) 

— a sub-sampled bitmap slid along the pen trajectory (9-dimensional vector), 

containing the current image information in a 30 x 30 window 

These 14-dimensional feature vectors x are presented to the HMMs (con- 
tinuous modeling technique) in one single stream for training and testing the 
recognition system. 

3 Adaptation Techniques 

For handwriting adaptation we compare two different adaptation approaches, 
which are well known in speech recognition: a retraining according to the maxi- 
mum likelihood (ML) criterion using the Baum- Welch algorithm (EM: expecta- 
tion maximization, compare |1 Op and the maximum a posteriori (MAP, see ^) 
adaptation. 

The goal is to maximize the matching between the Hidden Markov Models 
trained with the general database of 145 writers and a certain writer (wd: writer 
dependent) by considering different writing styles. When the amount of training 
or adaptation data is not sufficient for a robust training of all HMM parameters 
A, it is useful to reestimate only the means fj, and/or variances c of the Gaussians 
of the general HMM (wi: writer independent) and to leave the transitions and 
weights unchanged. 

To compare the ML-training (Eq^ with the MAP algorithm, in the following 
the objective functions are given in principle. 

\ 7~» / I \ \ E Ad E Ad / -I \ 

Aml = argmax P{X\\) ^ l-^AdL-i ^wi ^AdL (1) 

A 
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The MAP (or Bayesian) approach (Eq|2) takes the prior probability P{X), 
which is estimated by the training of the writer independent model, into account. 
Here, a separate transformation for each Gaussian is performed. The objective 
function leads to an interpolation between the original mean-vector (per state) 
and the estimated mean for the special writer depending on the prior probability. 
The function depends on the occupation likelihood of the adaptation data. 
A similar transform can also be estimated for the variances c. 

Xmap = argmax P(A|A) Ri argmax P{X\X)P{X) 

X X 

^ l^MAP ~ Jl l^wd T (1 f l') (2) 

The results in the following section have been achieved in a supervised mode 
and also in an unsupervised mode using the same parameters. 

4 Experiments 

In the presented experiments we examine the influence of the adaptation tech- 
nique and the amount of adaptation data, ranging from 6 to 100 words. 

Using the writer independent recognition system (baseline system) a word 
recognition rate of 86.0% is achieved testing the entire test-set of 4153 words. 
This test-set is halved for adaptation experiments. So, using the same test-set 
as for the writer dependent systems, a word recognition rate of 86.3% can be 
obtained on the half test-set consisting of 2071 words (compare Tab^wi). This 
recognition accuracy results from the fact that the individual accuracies of the 
writer-independent baseline system for each writer vary from 64.1% to 94.2%. 

TabQ presents the dependency of the recognition error on the amount of 
adaptation data and techniques. Again, all experimental results refer to an av- 
erage recognition rate of 21 different test writers. 

The recognition performance increases significantly, as it is expected, when the 
baseline system is adapted to a new writing style. The error reduction using 
unlabeled adaptation data is a little bit smaller than in the supervised mode 
(TablH last 2 columns). 

Using a relative large amount of adaptation data (about 100 words), the error 
is reduced by about 38% relative performing the ML-algorithm (Eq.pi, so that 
a recognition rate of 91.6% is achieved by reestimating the means only (Tab ID 

f) in a supervised mode. It can be shown, that the adaptation of the variances is 
less relevant for recognition performance compared to means-adaptation (TabCJ 

g) . Using the same database of 100 words for retraining, the accuracy of the 
baseline system increases to 90.0% by transforming only the variances. Adapting 
both parameters, means and variances, using the ML-algorithm, the recognition 
rate decreases to 84.8% (TabCJ h), indicating that the adaptation dataset is too 
sparse (100 adaptation words) in order to adapt both parameters simultaneously. 
So further tests only refer to means-adaptation. A second ML-retraining of the 
adaptation data leads to a further error reduction (TabCJ i)- 



36 



A. Brakensiek, A. Kosmala, and G. Rigoll 



Table 1. Word recognition results (%) for 21 writers comparing different adaptation 
techniques in a supervised and unsupervised mode 



baseline system (wi) 


86.3 


adaptation technique 


supervised 


unsupervised 


a) MAP: mean, 6 words for adaptation 1 


87.5 


87.4 


b) MAP: mean, 6 words for adaptation 2 


87.1 


87.1 


c) MAP: mean, 100 words for adaptation 


90.8 


90.0 


d) ML: mean, 6 words for adaptation 1 


86.2 


85.8 


e) ML: mean, 6 words for adaptation 2 


86.0 


85.6 


f) ML: mean, 100 words for adaptation 


91.6 


90.4 


g) ML: variance, 100 words for adaptation 


90.0 


88.9 


h) ML: mean-t variance, 100 words for adaptation 


84.8 


82.7 


i) 2xML: mean, 100 words for adaptation 


92.0 


90.8 



Even if only about 6 words are available for adaptation, the error rate can be 
reduced. To evaluate the recognition results obtained by a 6- word-adaptation, we 
repeat all these tests with another disjoint adaptation set of (randomly chosen) 
6 words (compare Tab^ a-b and d-e). Here, the best recognition performance 
is obtained by using the MAP-adaptation technique (EqEj). The recognition 
accuracy increases to 87.5% (TablD a; relative error reduction of about 9%) in 
the supervised mode. Depending on the amount of adaptation data, the MAP- 
adaptation outperforms a standard ML-retraining when only very few words are 
available. A ML-reestimation of the means by using only 6 words results in an 
increase of errors (TablU d,e). 

Although the error is reduced significantly by adapting a writer-independent 
system, the recognition accuracy of a real writer-dependent system, which is 
trained by about 2000 words of one single writer, is much higher. In this case 
a recognition rate of about 95% in average (4 writers) is obtained using a 30k 
dictionary (compare also lllll 'l. 

A further question is the use of confidence measures to improve the quality of 
the adaptation data in the unsupervised mode. One possibility to estimate the 
confidence score is the evaluation of a n-best recognition. Another interesting as- 
pect is the influence of normalization compared to adaptation on the recognition 
accuracy. These topics will be evaluated in the future work. 

5 Summary 

In this paper we presented the comparison of two different adaptation tech- 
niques, based on a HMM on-line handwriting recognition system for German 
cursive script samples. We investigated the performance of an ML-retraining (us- 
ing the EM-method), optimizing means and/or variances and a MAP-adaptation 
technique for a database of 21 writers in a supervised mode as well as in an un- 
supervised mode. 
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It has been shown that the recognition error can be reduced by using the 
MAP-adaptation up to 9% relative using only 6 words as adaptation-set. Per- 
forming adaptation with a larger database of 100 words the accuracy increases 
from 86.3% to 92.0%, using a simple ML-reestimation of the means only. Also 
the unsupervised ML-adaptation leads to a significant error reduction of about 
33% relative. 
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Abstract. We propose a novel approach to blood vessel detection in 
retinal images using shape-based multi-threshold probing. On an image 
set with hand-labeled ground truth our algorithm quantitatively demon- 
strates superior performance over the basic thresholding and another 
method recently reported in the literature. The core of our algorithm, 
classification-based multi-threshold probing, represents a general frame- 
work of segmentation that has not been explored in the literature thus 
far. We expect that the framework may be applied to a variety of other 
tasks. 



1 Introduction 

The retina of an eye is an essential part of the central visual pathways that enable 
humans to visualize the real world. Retinal images tell us about retinal, oph- 
thalmic, and even systematic diseases. The ophthalmologist uses retinal images 
to aid in diagnoses, to make measurements, and to look for change in lesions or 
severity of diseases. In particular, blood vessel appearance is an important indi- 
cator for many diagnoses, including diabetes, hypertension, and arteriosclerosis. 
An accurate detection of blood vessels provides us the fundamental for the mea- 
surement of a variety of features that can then be applied to tasks like diagnosis, 
treatment evaluation, and clinical study. In addition the detected vessel network 
supports the localization of the optic nerve |3] and the multimodal/temporal 
registration of ocular fundus images lan . 

In this work we propose a new algorithm for segmenting blood vessels in a 
retinal image. We systematically probe a series of thresholds on the retinal image. 
Each threshold results in a binary image in which part of the vessel network is 
visible. We detect these vessels by a shape-based classification method. Through 
the combination of the vessels detected from all thresholds we obtain an overall 
vessel network. 

The remainder of the paper begins with a brief discussion of related work. 
Then, we present our algorithm in Section B The results of an experimental 
evaluation follow in Section B Finally, some discussions conclude the paper. 

* The work was supported by the Stiftung OPOS Zugunsten von Wahrnehmungsbe- 
hinderten, St. Gallen, Switzerland. 
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Fig. 1. A retinal image (left) and thresholding results from two thresholds 60 (middle) 
and 120 (right), respectively. 



2 Related Work 

Various approaches have been proposed for blood vessel detection. In m the 
vessel network is detected by a sequence of morphological operations. A two- 
stage approach is proposed in cni.A watershed algorithm is applied to produce 
a highly oversegmented image on which an iterative merge step is performed. 

Filter-based methods assume some vessel model and match the model against 
the surrounding window of each pixel [ilSIDj . As pointed out in however, a 
simple thresholding of the filter responses generally cannot give satisfactory re- 
sults. Instead, more sophisticated processing is needed. This is done, for instance, 
by a multi-threshold region growing in 

A number of tracking-based approaches are known from the literature [121 
Using start points, either found automatically or specified by the user, 
these methods utilize some profile model to incrementally step along the vessels. 
Here it is crucial to start with reliable initial points. 

The recent work ^ is of particular interest here. The authors evaluated 
their vessel detection method on a retinal image set with hand-labeled vessel 
segmentation. They also made the image material and the ground truth data 
publically available. We will use these images to quantitatively evaluate and 
compare the performance of our algorithm. 



3 Algorithm 



From the methodology point of view our algorithm is very different from the 
previous ones discussed in the last section. We base the blood vessel detection on 
thresholding operations. Thresholding is an important segmentation technique 
and a vast number of approaches are known from the literature; see for a 

survey and comparative performance studies. The nature of retinal images (both 
nonuniform intensities of vessels and background), however, does not allow a 
direct use of traditional thresholding techniques. This is illustrated in Figure ^ 
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Fig. 2. Quantities for testing curvilinear bands. 



where two different thresholds make different parts of the vessel network visiblcQ- 
The fundamental observation here is that a particular part of the vessel network 
can be well marked by some appropriate threshold which is, however, not known 
in advance. This motivates us to probe a series of thresholds and to detect vessels 
in the binary image resulting from the respective threshold. These partial results 
are then combined to yield an overall vessel network. 

Thresholding with a particular threshold T leads to a binary image Bt, in 
which a pixel with intensity lower than T is marked as potential vessel point, 
and all other pixels as background. Then, vessels are considered to be curvilinear 
structures in Tb, i.e. lines or curves with some limited width. Our approach 
to vessel detection in a binary image consists of three steps: a) perform an 
Euclidean distance transform on Bt to obtain a distance map; b) prune the 
vessel candidates by means of the distance map to only retain center line pixels 
of curvilinear bands; c) reconstruct the curvilinear bands from their center line 
pixels. The reconstructed curvilinear bands give that part of the vessel network 
that is made visible by the particular threshold T. In the following the three 
steps are described in some detail. 

We apply the fast algorithm for Euclidean distance transform from |^. For 
each candidate vessel point the resulting distance map contains the distance to 
its nearest background pixel and the position of that background pixel. 

The pruning operation is the most crucial step in our approach. We use two 
measures (j) and d introduced in |3 to quantify the likelihood of a vessel candidate 
being a center line pixel of a curvilinear band of limited width. The meaning of 
(f> and d can be easily understood in Figure |21 where p and n represent a vessel 
candidate and one of the eight neighbors from its neighborhood iVp, respectively, 
Cp and Cn are their corresponding nearest background pixel. The two measures 



^ Note that retinal images are usually captured in full color. Using a RGB color model, 
the blue band is often empty and the red band is often saturated. Therefore, we only 
work with the green band of retinal images. 
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are defined by: 

, , , ^ 180 pe^-pe^ 

n&N, ^ n) ^ llpepll- llpe^^ll 

d = max I ICpCnl | 

n^Np 

The center line pixels of curvilinear bands typically have high values of (p. There- 
fore, our first pruning operation is based on a limit on (j). Since the blood 
vessels only have a limited width which is approximated by d, we also put a limit 
Pd on d to only retain thin curvilinear bands. 

Applying the two pruning operations above sometimes produces thin curvi- 
linear bands that show only very weak contrast to the surrounding area and are 
not vessels. For this reason we introduce a third photometric test by requiring a 
sufficient contrast. The contrast measure is defined by: 

intensity (g'n) 

C = max 

neNp intensity(p) 

where = Pfac ' Wn, Pfac > 1) represents a pixel of the background in 

the direction pe„. The corresponding pruning operation is based on a limit Pint 
applied to c. 

After the three pruning operations we obtain thin curvilinear bands, implic- 
itly defined by their respective center line pixels. Sometimes, however, a few 
very short structures which are unlikely to be blood vessels are among the de- 
tected bands. Therefore, we put an additional size constraint in the following 
manner. A component labeling is performed on the image containing the center 
line pixels. Any component smaller than a limit Pgize is excluded from further 
consideration. 

In summary we totally defined four pruning operations with respect to the 
properties angle {(f), width (d), contrast (c), and size. The remaining center line 
pixels in the final result, together with their respective distance values can then 
be used to reconstruct the blood vessels that are made visible by the particular 
threshold T. 

The threshold probing is done for a series of thresholds. A straightforward 
way is to step through the intensity domain defined by the lowest and the highest 
intensity of a retinal image at a fixed step T^tep. In our experimental evaluation 
described in the next section Tgtep is set to be 2. The results from the different 
thresholds can be simply merged to form an overall vessel image. 

4 Experimental Evaluation 

For two reasons we have chosen the set of twenty retinal images used in ^ to 
evaluate the performance of our algorithm. First, the retinal images have all 
hand-labeled vessel segmentation and allow therefore quantitative evaluation in 
an objective manner. Second, the image set contains both normal and abnormal 
(pathological) cases. In contrast most of the referenced methods have only been 
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Fig. 3. A retinal image (left), hand-labeling GT (middle), and detected vessels (right). 



demonstrated upon normal vessel appearances, which are easier to discern. The 
twenty images were digitized from retinal fundus slides and are of 605 x 700 
pixels, 24 bits per pixel (standard RGB). We only use the green band of the 
images. 

Given a machine-segmented result image (MS) and its corresponding hand- 
labeled ground truth image (GT), the performance is measured as follows. Any 
pixel which is marked as vessel in both MS and GT is counted as a true positive. 
Any pixel which is marked as vessel in MS but not in GT is counted as a false 
positive. The true positive rate is established by dividing the number of true 
positives by the total number of vessel pixels in GT. The false positive rate 
is computed by dividing the number of false positives by the total number of 
non- vessel pixels in GT. 

Our algorithm has five parameters: P^, Pd, Pfac Pint, Psize- We report results 
using eight sets of parameter values: P^ = 150 — 5k, Pd = 6 + k, Pfac = 2.0, 
Pint = 1.07, Psize = 46 — 3k, fc = 0, 1, . . . , 7. As an example, the detection result 
for the retinal image in Figure^ processed at fc = 3, is given in Figure 0 Figure 
0 shows the average performance curve over the twenty images for the eight 
parameter sets. 

For the twenty images there is a second hand-labeling made by a different 
person available. The first hand-labeling, which is used as ground truth in our 
evaluation, took a more conservative view of the vessel boundaries and in the 
identification of small vessels than the second hand-labeling. By considering 
the second hand-labeling as “machine-segmented result images” and comparing 
them to the ground truth, we can establish a target performance level. This level 
is indicated by the isolated circle mark in Figure 0 Our algorithm produces the 
same number of false positives at a 77% true positive rate as the second set 
of hand-labeled images at a 90% true positive rate. This suggests room for an 
improvement of 13% in the true positive rate over our method. 

As reference, the average performance curve for the method based on filter 
response analysis 0 is drawn in Figure0 We observe similar performance up to 
approximately 75% true positive rate. For higher true positive rates our approach 
shows superior performance w.r.t. the false positive rate. In particular, we are 
able to achieve 90% true positive rate of the second hand-labeling at a false 
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Fig. 4. Average performance over twenty images; our approach vs. other methods. 



positive rate which is slightly higher than doubling that of the second hand- 
labeling, but much lower than that would be required by the approach from 

0 - 

Our multi-threshold probing is motivated by the fact that a single thresh- 
old only reveals part of the vessel network and therefore the basic thresholding 
does not work well. This is quantitatively demonstrated by the average perfor- 
mance curve for the basic thresholding in Figure^, established for the thresholds 
40, 50, 60,..., 150. 

The computation time amounts to about 26 seconds on a Pentium III 600 
PC. 



5 Conclusion 



We have proposed a novel approach to blood vessel detection in retinal images 
using shape-based multi-threshold probing. On an image set with hand-labeled 
ground truth our algorithm has quantitatively demonstrated superior perfor- 
mance over the basic thresholding and the method reported in 

The core of our algorithm, threshold probing, is independent of the particular 
shape-based classification of thresholded images into vessel and non- vessel parts. 
To the knowledge of the authors, such a general framework of classification-based 
multi-threshold probing has not been explored in the literature thus far. We 
expect that the framework may be applied to other tasks, such as the detection 
of pigmented skin lesions and the segmentation of microaneurysms in ophthalmic 
fluorescein angiograms. These topics are among our current research interests. 
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Abstract. This paper introduces a new approach on automatic vehicle 
detection in monocular large scale aerial images. The extraction is based 
on a hierarchical model that describes the prominent vehicle features 
on different levels of detail. Besides the object properties, the model 
comprises also contextual knowledge, i.e., relations between a vehicle and 
other objects as, e.g., the pavement beside a vehicle and the sun causing a 
vehicle’s shadow projection. In contrast to most of the related work, our 
approach neither relies on external information like digital maps or site 
models, nor it is limited to very specific vehicle models. Various examples 
illustrate the applicability and flexibility of this approach. However, they 
also show the deficiencies which clearly define the next steps of our future 
work. 



1 Introduction 

Motivated from different points of view, automatic vehicle detection in remotely 
sensed data evolved to an important research issue for more than one decade. In 
the field of military reconnaissance, for instance, intense research on Automatic 
Target Recognition (ATR) and monitoring systems has been conducted (see ^ 
for a review), best exemplified by the DARPA RADIUS project during the 90’s. 
The vehicle types to detect by ATR systems have often quite specific and dis- 
criminative features, however, these approaches must tackle problems due to 
noisy data (e.g., RADAR images) or bad viewing conditions. Considering civil 
applications, the growing amount of traffic and the need arising therefrom to 
(partially) automate the control over traffic ffow emphasizes the role of vehicle 
detection and tracking as substantial part of Intelligent Transportation Systems 
(ITS; P). Matured approaches, e.g., the ones of Nagel and coworkers (cf. P,jS|) 
or Sullivan and coworkers (cf. pi II I , jl 4 ) 1. use vehicle and motion models for the 
detection and tracking of road vehicles in image sequences taken from station- 
ary cameras. While these systems often incorporate site-model information (e.g., 
the lanes of a road intersection), approaches related with the acquisition of geo- 
graphic data - including ours - aim to extract or update site-model information 
(cf. 0,0, mi). Here, the use of vehicle detection schemes is motivated by the fact 
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that, especially in aerial images taken over urban areas, site-model components 
like roads and parking lots are mainly defined by cars driving or parking there. 
Hence an important goal of our work is achieving a good overall completeness of 
the extraction rather than detecting or discriminating specific vehicle types (as, 
e.g. in 0,0). 

2 Related Work 

As it is argued in 0, one of the key problems of vehicle extraction is the high 
variability of vehicle types, and thus the intrinsically high dimensional search 
space. Different attempts have been made to cope with this: 

Olson et al. |H] generate 2D templates (the silhouettes of vehicles) by cal- 
culating different views on 3D objects. Vehicles are then detected by matching 
the silhouettes to the binary edge map of the image using a modified Hausdorff 
measure. In order to increase efficiency, matching is based on a hierarchical tree- 
like model that classifies the silhouettes according to their similarity. Ruskone et 
al. [12 (and also Quint (Of) exploit the contextual knowledge that - especially in 
urban areas - vehicles are often arranged in collinear or parallel groups. Hence 
Ruskone et al. El allow a high false alarm rate after the initial steps of vehi- 
cle extraction with a neural network classifier. They show that many erroneous 
detection results can afterwards be eliminated by iteratively grouping collinear 
and parallel vehicles and rejecting all others. However, the results indicate that a 
neural network classifier is probably not powerful enough for the task of vehicle 
detection when applied to the raw image data. 

The systems of Chellappa et al. |2| and Quint j0| integrate external informa- 
tion from site-models and digital maps, respectively, which reduces the search for 
vehicles to certain image regions. Chellappa et al. |2j extend this further by con- 
straining the extraction along the dominant direction of roads and parking lots. 
Another advantage of this method is the fact that many image regions possibly 
containing vehicle-like structures as, e.g., chimneys on building roofs, are ex- 
cluded from the extraction. Thus, less sophisticated vehicle models are sufficient 
for such systems (these approaches use basically simple rectangular masks). A 
crucial point when using external information, however, is the need for data con- 
sistency. For civil applications site-models are rare, often generalized, possibly 
outdated, or just not accurately enough registered. Therefore, such approaches 
are usually less flexible or they have to cope with site-model registration and 
update in addition. 

In order to keep our approach as fiexible as possible, we decided to use data 
stemming from one single source instead of integrating other information, i.e., 
we use pan-chromatic high resolution aerial images (about 10 cm resolution on 
the ground) and standard meta-data like calibration and orientation parameters 
as well as the image capture time. 
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3 Model and Extraction Strategy 

The discussion above shows that a reliable vehicle detection scheme must be 
based on a sophisticated model that focuses on few but prominent features com- 
mon most of the vehicles types. On the other hand, the extraction strategy 
should be able to efficiently manage the varying appearance of vehicles even for 
large images. This leads us to a hierarchical model, that describes the appear- 
ance of vehicles at different levels of detail. Other distinctive features closely 
related to vehicles, though not intrinsic properties, are included by modeling the 
local context of vehicles (for instance, a vehicle’s shadow projection). 



3.1 Model 

The knowledge about vehicles which is represented by our model is summarized 
in Fig.n Due to the use of aerial imagery, i.e., almost vertical views, the model 
comprises exclusively features of a vehicle’s upper surface and neglects the fea- 
tures of the sides up to now. 3D information is used for predicting the shadow 
region, though. 

At the lowest level of detail (Level 1), we model a vehicle’s outline as convex, 
somewhat compact, and almost rectangular 2D region, possibly showing high 
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edge density inside. This captures nearly all typical vehicles except busses and 
trucks. At the next level (Level 2), we incorporate basic vehicle sub-structures, 
e.g., the windshields and uniformly colored surfaces (front, top, and rear) that 
form a sequence of connected 2D-rectangles — denoted as rectangle sequences in 
the sequel. This level still describes many types of passenger cars, vans, and small 
trucks since the rectangles may vary in their size and, what is more, an object 
instance is not constrained to consist of all of the modeled sub-structures (ex- 
emplified by two instances at Level 2 in Fig. ^). The last level (Level 3) contains 
3D information and local context of vehicles. Depending on the sub-structures 
of the previous level different simple height profiles are introduced which enrich 
the 2D model with the third dimension. As a vehicle’s local context we include 
the bright and homogeneous pavement around a vehicle and a vehicle’s shadow 
projection. Especially the shadow projection has proven to be a reliable feature, 
since aerial images are only taken under excellent weather conditions — at least 
for civil applications. 

Please note, that the context model needs no external information except 
ground control points for geo-coding the images. If all image orientation param- 
eters and the image acquisition’s date and daytime are known — which is indeed 
the standard case for aerial imagery — then it is easy to compute the direction of 
the sun rays and derive therefrom the shadow region projected on the horizontal 
road surface. 

3.2 Extraction Strategy 

The extraction strategy is derived from the vehicle model and, consequently, fol- 
lows the two paradigms ’’coarse to fine” and ’’hypothesize and test”. It consists 
of following 4 steps: (1) Creation of Regions of Interest (Rols) which is mainly 
based on edge voting for convex and compact regions, (2) Hypotheses formation 
for deriving rectangle sequences from extracted lines, edges, and surfaces, (3) 
Hypotheses validation and selection which includes the radiometric and geomet- 
ric analysis of rectangle sequences, and (4) Verification using 3D vehicle models 
and their local context. See also Fig.|2|for an illustration of the individual steps. 

In order to avoid time-consuming grouping algorithms in the early stages 
of extraction, we first focus on generic image features as edges, lines, and surfaces. 

Creation of Rols: We start with the extraction of edges, link them 
into contours, and robustly estimate the local direction and curvature at each 
contour point. Similar to a generalized (i.e., elliptical) Hough-Transform we use 
direction and curvature as well as intervals for length and width of vehicles for 
deriving the centers and orientations of the Rols (Fig. |3a) and b). 

Hypotheses formation: One of the most obvious vehicle features in 
aerial images is the line-like dark front windshield. Therefore, we extract dark 
lines of a certain width, length and straightness in the Rols using the differential 
geometric approach of Steger and, because vehicles are usually uniformly 
colored, we select those lines as windshield candidates which have symmetric 
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a) Extracted edges b) Voting (bright=RoI) c) Edges, lines, and surfaces 




d) Rectangle sequences e) Validated hypothesis f) Shadow verification 

Fig. 2. Individual steps of vehicle detection 



contrast on both sides (see bold white lines in Fig. He). Front, top, and rear 
of a vehicle mostly appear as more or less bright “blob” in the vicinity of a 
windshield. Thus, we robustly fit a second-order gray value surface to each image 
part adjacent to windshield candidates (apexes are indicated by white crosses 
in Fig. H c) and keep only those blobs whose estimated surface parameters 
satisfy the condition of an ellipsoidal or parabolic surface. As a side effect, both 
algorithms for line and surface extraction return the contours of their bounding 
edges (bold black lines in Fig. H c). These contours and additional edges 
extracted in their neighborhood (thin white lines in Fig. He) are approximated 
by line segments. Then, we group the segments into rectangles and, furthermore, 
rectangle sequences. Fitting the rectangle sequences to the dominant direction 
of the underlying image edges yields the final 2D vehicle hypotheses (Fig. Hd). 

Hypotheses validation and selection: Before constructing 3D models 
from the 2D hypotheses we check the hypotheses for validity. Each rectangle 
sequence is evaluated regarding the geometric criteria length, width, and 
length/width ratio, as well as the radiometric criteria homogeneity of an 
individual rectangle and gray value constancy of rectangles connected with a 
’’windshield rectangle”. The validation is based on fuzzy-set theory and ensures 
that only promising, non-overlapping hypotheses are selected (Fig. He). 



50 



S. Hinz and A. Baumgartner 



3D model generation and verification: In order to approximate the 
3D shape of a vehicle hypothesis, a particular height profile is selected from a 
set of predefined profiles. Please note that the hypothesis’ underlying rectangles 
remain unchanged, i.e., the height values of the profile refer to the image edges 
perpendicular to the vehicle direction. The selection of the profiles depends on 
the extracted sub-structures, i.e., the shape of the validated rectangle sequence. 
We distinguish rectangle sequences corresponding to 3 types of vehicles: 
hatch-back cars, saloon cars, and other vehicles such as vans, small trucks, etc. 
In contrast to hatch-back and saloon cars, the derivation of an accurate height 
profile for the last category would require a deeper analysis of the hypotheses 
(e.g., for an unambiguous determination of the vehicle orientation). Hence, in 
this case, we approximate the height profile only roughly by an elliptic arc 
having a constant height offset above the ground. 

After creating a 3D model from the 2D hypothesis and the respective height 
profile we are able to predict the boundary of a vehicle’s shadow projection on 
the underlying road surface. A vehicle hypothesis is judged as verified if a dark 
and homogeneous region is extracted besides the shadowed part of the vehicle 
and a bright and homogeneous region besides the illuminated part, respectively 

(Fig. Elf). 



4 Results and Discussion 



Figure 0 shows parts of an aerial image with a ground resolution of 9 cm. The 
white rectangle sequences indicate the extracted cars and their sub-structures. In 
the image part shown in Fig. 0 a), all cars have been extracted successfully. Due 
to the weak contrast, however, the shape of the two cars in the center has not 
been recovered exactly. For instance, the corresponding sub-structure of the front 
windshield of the upper car is missing. Despite of this, shape and underlying edge 
support of the rectangle as well as a successful shadow verification gave enough 
evidence to judge this particular hypothesis correctly. 

Although three vehicles are successfully detected, the result of the image part 
shown in Fig. 0b) shows more severe failures. The dark vehicle in the upper left 
part has been missed because its body, windshields, and shadow region show very 
similar gray values. Hence, the system has constructed only one single, over-sized 
rectangle which has been rejected during the hypotheses selection phase. While 
this kind of error is presumably quite easy to overcome by using color images, 
the reason for missing the car in the lower right corner is of more fundamental 
nature and strongly related to the specular characteristics of cars. As can be 
seen from Fig. 0 (Level sG), due to specularities front and top of the car show 
very different average gray values which caused the rejection of this hypothesis. 

Hence, we can conclude that the extraction is quite reliable if (at least) some 
of the modeled sub-structures are easy to identify. However, the system tends to 

image has been rotated for display 
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(c) 



(d) 



Fig. 3. Extraction results on pan-chromatic aerial images (resolution: 9 cm) 



miss vehicles due to weak contrast, specularities, and influences of neighboring 
objects. For addressing these problems, we have to extend our model by rela- 
tions between individual vehicles and vehicle groupings in order to incorporate 
more global knowledge like similar orientations and rather symmetric spacing 
(e.g., a dark ’’blob” between two previously extracted vehicles is likely to be 
another vehicle). Further improvements seem possible if color imagery and/or 
more sophisticated 3D models are used. Last but not least, we plan to integrate 
this approach into the road extraction system described in |n|. 
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Abstract. This paper deals with the reconstruction of original image 
structure in the presence of local disturbances such as specular reflec- 
tions. It presents two novel schemes for their elimination with respect 
to the local image structure: an efficient linear interpolation scheme and 
an iterative filling-in approach employing anisotropic diffusion. The al- 
gorithms are evaluated on Images of the heart surface and are suited to 
support tracking of natural landmarks on the beating heart. 



1 Introduction 



Glossy surfaces give rise to specular reflection from light sources. Without proper 
identification specularities are often mistaken for genuine surface markings by 
computer vision applications such as matching models to objects, deriving mo- 
tion fields from optical flow or estimating depth from binocular stereo IBIHHI- 

This paper presents two approaches to reconstruct the original image struc- 
ture in the presence of local disturbances. The algorithms have been developed 
to enable robust tracking of natural landmarks on the heart surface [GOSHOr] 
as part of the visual servoing component in a minimally invasive robotic surgery 
scenario lORS+nil . There specular reflections of the point light source arise on 
the curved and deforming surface of the beating heart. Due to sudden and ir- 
regular occurrence these highlights disturb tracking of natural landmarks on 
the beating heart considerably [GrbOPj . Reconstruction schemes are sufficiently 
general for application in other fields, where disturbances in images should be 
eliminated ensuring continuity of local structures. 

Previous work mainly investigates specular, together with diffuse, reflection 
[IRRRRIWolfldlJ . which aim to suppress the specular component while enhancing 
the diffuse. This work considers local specular reflections with no detectable dif- 
fuse components, which cause total loss of information. Therefore reconstruction 
can only be guided by surrounding image structures. 

The following section introduces robust extraction of image structure by the 
structure tensor which two schemes for reconstruction are based on: linear in- 
terpolation between boundary pixels and anisotropic diffusion within a filling-in 
scheme. The algorithms are evaluated on video images of the heart with specu- 
larities (Sect.OD, before concluding with a summary of results and perspectives. 
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2 Reconstruction 

Specularities occur as highlights on the glossy heart surface (Fig. [Q. Since their 
grey values are distinctively high and independent of neighbourhood intensities, 
simple thresholding can be applied for segmentation. Structure inside specular 
areas is restored from local structure information determined by the well-known 
structure tensor. This yields reconstruction which is most likely to correspond 
to the original area on condition that surface structures possess some continuity. 
Therefore intensity information mainly from boundary points along the current 
orientation is used. Results are presented for the mechanically stabilized area of 
interest of the beating heart (Fig. 




Fig. 1. Original image with specularities (detail) 



2.1 Structure Detection 

The structure tensor provides a reliable measure of the coherence of structures 
and their orientation, derived from surrounding gradient information. For more 
details see ISHj. 

Definition 1 (Structure tensor). For an image f with Gaussian smoothed 
gradient V/o- V((7cr * /) the structure tensor is defined as 

( cMl)2 df„ \ 

Jp(V/.) = 5p * (V/. ® V/.) = (1) 

\ dx 8y y dy ' ) 

where gp is a Gaussian kernel of standard deviation p > 0, separately convolved 
with the components of the matrix resulting from the tensor product ®. 

The noise scale a reduces image noise before the gradient operator is applied, 
while the integration scale p adjusts J to the size of structures to be detected. 
The eigenvalues \\/2 of the structure tensor Jp measure the average contrast 
in the direction of the eigenvectors (over some area specified by the integration 
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scale p) . Since Jp is symmetric, its eigenvectors v \ , V 2 are orthonormal [Wei98j . 
The eigenvector v± corresponding to the eigenvalue with greatest absolute value 
(Ai) gives the orientation of highest grey value fluctuation. The other eigenvector 
V 2 , which is orthogonal to vi, gives the preferred local orientation, the coherence 
direction. Moreover, the term (Ai — X 2 Y is a measure of the local coherence and 
becomes large for anisotropic structures |Wei98| . 

The structure tensor Jp is used to extract structure orientation for recon- 
struction. Anisotropic confidence-based fllling-in also makes use of the coherence 
measure (Ai — \ 2 )^ for structure dependent diffusion (Sect.[S|)- 

2.2 Structure Tensor Linear Interpolation 

Specular areas are reconstructed by interpolation between boundary points along 
the main orientation of the local structure, extracted by the structure tensor. 

Extraction of Local Structure. Local orientation is given by the eigenvector V 2 
belonging to the minor eigenvalue A 2 of the structure tensor Jp (Sect. 12.111 . 
which is required for every specular pixel. To detect structure in the given heart 
images appropriate values for noise and integration scale are cr = 1 and p = 2.8, 
corresponding to a catchment area of about 9x9 pixels. 

Reconstruction. For each point inside the specular area, search for the two 
boundary points along the structure orientation associated to it. Linear interpo- 
lation between the intensities of the boundary points, weighted according to their 
relative distances, yields the new value at the current position. Final low-pass 
Altering ensures sufficient smoothness in the reconstructed area. 

2.3 Anisotropic Confidence-Based Filling-In 

This reconstruction scheme fills specular areas from the boundary based on local 
structure information. It employs coherence enhancing anisotropic diffusion. 

Coherence Enhancing Anisotropic Diffusion. Diffusion is generally con- 
ceived as a physical process that equilibrates concentration without creating or 
destroying mass. Applied to images, intensity at a certain location is identified 
with concentration. Thus the diffusion process implies smoothing of peaks and 
sharp changes of intensity, where the image gradient is strong. 

The discussed type of diffusion, designed to enhance the coherence of struc- 
tures fw^ . is anisotropic and inhomogeneous. Inhomogeneous means that 
the strength of diffusion depends on the current image position, e.g. the ab- 
solute value of the gradient V/. Further, anisotropic diffusion corresponds to 
non-uniform smoothing, e.g. directed along structures. Thus edges can not only 
be preserved but even enhanced. 

The diffusion tensor D = D{Jp{V fa)), needed to specify anisotropy, has the 
same set of eigenvectors vi,V 2 as the structure tensor Jp (Sect. 12. ill , reflect- 
ing local image structure. Its eigenvalues A'i,A 2 are chosen to enhance coherent 
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structures, which implies a smoothing preference along the coherence direction 
V 2 with diffusivity A '2 increasing with respect to the coherence (Ai — \ 2 )^ of Jpi 



Al 0^3 



\> 

A2 — 



a if Al = A 2 , 

a + {1 — a) exp f otherwise 



(2) 



where C > 0, m G N, a £ ]0, 1[, and the exponential function ensures the smooth- 
ness of D. For homogeneous regions, a specifies the strength of diffusion. We 
follow |WeiH8| with C = 1, to = 1, and a = 0.001. 



Anisotropic Confidence-Based Filling-In. A filling-in scheme for structure 
driven reconstruction is developed, employing coherence enhancing anisotropic 
diffusion. The algorithm is an extension of the confidence-based filling-in model 
of Neumann and Pessoa [IN and is described by the following equation: 

Definition 2 (Anisotropic confidence-based filling-in). For an image f, 
evolving over time as ft, anisotropic confidence-based filling-in is given by 

^ = (1 - c)div(D(Jp(V/o))V/t) + c(/o - ft) (3) 

where fo is the initial image and c : dom(/) — >■ [0, 1] is a confidence measure. 

In the present work, c maps to {0,1}, where c = 0 refers to unreliable image 
information, i.e. specularities, and non-specular image points provide reliable 
information (c = 1). Therefore non-specular regions are not modified, while 
specular points are processed according to 

^ = div(D( Jp(V/o))V/t) . (4) 

Algorithm. Linear diffusion equations like 0 are commonly solved by relax- 
ation methods. Intensity from boundary pixels is propagated into the specular 
area, or vice versa, specular intensity is drained from it (source-sink model). 
In the current implementation reconstruction calculates the dynamic changes of 
diffusion over time in an iterative approach. In each step diffusion is represented 
by convolution with space-invariant kernels, shaped according to the diffusion 
tensor as in fW^ . The algorithm runs until the relative change of specular 
intensity is sufficiently small. 

As intensities at the boundary are constant, diffusion corresponds to inter- 
polation between boundary points. Filling-In is related to regularization theory 
firm in employing a smoothness and a data-related term. The filling-in model 
discussed here incorporates a linear diffusion scheme. Local structure informa- 
tion for the diffusion process is extracted only from the initial image, which 
allows efficient implementation of the filling-in process. 
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3 Evaluation 

Reconstruction is required to meet the following criteria: 

(RO) Restriction to specular areas 

(Rl) Smooth reconstruction inside specular areas 

(R2) Smooth transition to boundaries 

(R3) Continuity of local structures 

(Al) Realtime operability 

As a basic requirement only specular areas shall be altered (RO). To avoid new 
artefacts, e.g. edges, smoothness is assessed both inside specular areas (Rl) 
and at the boundary (R2). Structures next to specularities should be continued 
within reconstructed areas (R3), because uniform filling-in from the boundary 
may cause new artefacts by interrupted structures. Criterion (Al) is kept in mind 
to enable realtime application, as required for tracking in a stream of images. 



3.1 Reconstruction at Specularities on the Heart Surface 

First, quality of reconstruction within structured areas is measured by introduc- 
ing an artificial disturbance on a horizontal edge, i.e. a 5 x 5 pixel square with 
intensity similar to specular reflection (Fig. El left). The other two images show 
reconstruction by structure tensor linear interpolation and anisotropic filling-in: 
The artificial specularity vanishes and original structure is restored. 

Since the underlying structure is known, reconstructed areas of both methods 
can be compared with the original area by the sum of squared differences (SSD), 
a similarity measure also commonly used for tracking: With intensities between 0 
and 255, SSD is 872 for structure tensor interpolation versus 553 for anisotropic 
filling-in, which is slightly superior as intensities are closer to the original image. 




Fig. 2. Elimination of artificial specularity (left to right: original, dto. with artificial 
specularity (bright square), reconstruction by linear interpolation, and by anisotropic 
filling-in) 



Secondly, reconstruction is considered with real specularities on the heart 
surface for the area of interest in the original image (Fig. Q). Figures 0and E] 
show reconstruction by linear interpolation and anisotropic filling-in. 
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Structure Tensor Linear Interpolation. Results of linear interpolation be- 
tween boundary points are good based on robust orientation information by 
the structure tensor. Structures are continued (R3), and sufficient smoothness is 
provided at the boundaries (R2), and within reconstructed areas (Rl) by final 
low-pass filtering. 




Fig. 3. Image reconstructed by structure tensor interpolation (detail) 



Time Complexity. Let n be the image size. To compute the structure tensor, 
filter operations with the Gaussian derivative to obtain V/o- and applying Gaus- 
sians Pp to its components each require O(nlogn), for convolution as complex 
multiplication in Fourier space using FFT. Including 0{n) for eigenvector com- 
putation complexity remains O(nlogn) for the structure tensor. Reconstruction 
by linear interpolation follows: assuming without loss a quadratic image there 
are at most n specular pixels, for each of which search for boundary points takes 
y/n steps. This gives an overall complexity of 0{y/nn) including the complexities 
for the structure tensor and final smoothing (O(nlogn)). Reconstruction of a 
720 X 288 image requires less than two seconds on a 600 MHz Pentium-Ill Linux 
workstation, which is a non-optimised upper bound. 



Anisotropic Confidence-Based Filling-In. The anisotropic filling-in scheme 
preserves structure occluded by specular reflections (R3). Inherent smoothness 
of the diffusion process inside (Rl) and at the boundaries of specular reflections 
(R2) can be visually confirmed (Fig. Specularities can be eliminated if the 
iteration scheme is applied long enough, but not all intensity from large specular 
areas may be drained if time is restricted. 

Time Complexity. The existence of a steady state and the number of iterations 
needed depends on the properties of the diffusion equation 0) , more closely on its 
eigenvalues. Each iteration requires convolution with a space-dependent kernel 
for each specular pixel, yielding O(nlogn) for an image of size n bounding the 
number of specular pixels. Additionally the number of iterations required grows 
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Fig. 4. Image reconstructed by anisotropic filling-in (detail) 



with the size of specular areas. Up to 800 iterations are needed to eliminate 
most specularities on the heart surface, which takes more than one minute for a 
720 X 288 image in the current implementation. 

Time performance may be improved by multigrid methods to solve the diffu- 
sion equation IIMAT'ifil . Recent research also indicates speedup by a hierarchical 
filling-in scheme mm or a neural network approach which accelerates 

the diffusion process by learning changes made by diffusion over a number of 
iterations. 



3.2 Discussion 

Both schemes provide good results for reconstruction of specularities, where 
all criteria (R0)-(R3) can be met: The structure tensor interpolation scheme 
is fast but does not ensure smoothness within reconstructed areas, which can 
be compensated easily. Anisotropic filling-in creates inherently smooth areas; it 
applies a diffusion process depending on both orientation and strength of local 
structures. However, computational cost is much higher than for the structure 
tensor which implies worse reconstruction if time is restricted. 

Structure information obtained by the structure tensor from the original im- 
age is influenced by specularities. Disturbed areas should not be large compared 
to local structures, since otherwise reasonable reconstruction is not possible: 
Structure information is distorted and small details cannot be reconstructed. 



4 Conclusion and Perspectives 

Two novel schemes have been developed to reconstruct original image structure 
in the presence of local disturbances. Anisotropic confidence-based filling-in em- 
ploys smoothness constraints to reach an optimum solution for reconstruction. 
Linear interpolation between boundary points in the coherence direction reaches 
comparable results with considerably lower computational expense suited for 
realtime tracking of image structures 
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The algorithms have been successfully applied to reconstruct the heart sur- 
face in the presence of specular reflections. Further, tracking natural landmarks 
on the beating heart can be improved greatly, as outliers caused by specularities 
are compensated ITFCT1 . Apart from tracking the algorithms are sufficiently 
general to reconstruct structured images partially occluded by disturbances. 
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Abstract. We demonstrate a method to automatically extract 
spatio-temporal descriptions of human faces from synchronized and 
calibrated multi- view sequences. The head is modeled by a time- varying 
multi-resolution subdivision surface that is fitted to the observed person 
using spatio-temporal multi-view stereo information, as well as contour 
constraints. The stereo data is utilized by computing the normalized 
correlation between corresponding spatio-temporal image trajectories 
of surface patches, while the contour information is determined using 
incremental background subtraction. We globally optimize the shape of 
the spatio-temporal surface in a coarse-to-fine manner using the multi- 
resolution structure of the subdivision mesh. The method presented 
incorporates the available image information in a unified framework and 
automatically reconstructs accurate spatio-temporal representations of 
complex non-rigidly moving objects. 

Keywords: Multi-View Stereo, 3D Motion Flow, Motion Estimation, 
Subdivision Surfaces, Spatio-temporal Image Analysis 



1 Introduction 

What kind of representation is necessary to understand, simulate and copy an 
activity of a person or an animal? We need a spatio-temporal representation of 
the actor in action that captures all the essential shape and motion information 
that an observer can perceive in a view-independent way. This representation 
is made up by the 3D spatial structure as well as the 3D motion field on the 
object surface (also called range flow [E| or scene flow m) describing the first 
order temporal changes of the object surface. Such a representation of the shape 
and motion of an object would be very useful in a large number of applications. 
Even for skilled artists it is very hard to animate computer graphics models 
of living non-rigid objects in a natural way, thus a technique that enables one 
to recover the accurate 3D displacements would add another dimension to the 
realism of the animation. Accurate 3D motion estimation is also of great interest 
to the biomedical field where the motion of athletes or patients is analyzed. The 
recovery of dense 3D motion fields will also help us to understand the human 



B. Radig and S. Florczyk (Eds.): DAGM 2001, LNCS 2191, pp. 61-[^^ 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 



62 



J. Neumann and Y. Aloimonos 



representation of action by comparing the information content of 3D and 2D 
motion fields for object recognition and task understanding. 

We present a new algorithm that computes accurately the 3D structure and 
motion fields of an object solely from image data utilizing silhouette and spatio- 
temporal image information without any manual intervention. There are many 
methods in the literature that recover three-dimensional structure information 
from image sequences captured by multiple cameras. Some example techniques 
are based on voxel carving 1 1 2f \ t)j or multi-base line stereo j1 St3) . sometimes 
followed by a second stage to compute the 3D flow m from optical flow in- 
formation. Unfortunately, these approaches usually do not try to compute the 
motion and structure of the object simultaneously, thereby not exploiting the 
synergistic relationship between structure and motion constraints. Approaches 
that recover the motion and structure of an object simultaneously, most often 
depend on the availability of prior information such as an articulated human 
model PID| or an animation mask for the face Some of the few examples 
of combined estimation without domain restricted model assumptions are 1201 
E]. But in contrast to our approach the scene is still parameterized with respect 
to a base view, whereas we use the view-independent object space parameteri- 
zation. Recently, HD used a motion and voxel carving approach to recover both 
motion and structure. They use a fixed 6D-voxel grid to compute the structure 
and motion, while our subdivision surface representation enables us to optimize 
the different “frequency bands” of the mesh separately m similar to multi-grid 
methods in numerical analysis. 

2 Preliminaries and Definitions 

The camera configuration (Figure 1) is parameterized by the camera projection 
matrices Mk that relate the world coordinate system to the image coordinate 
system of camera k. The calibration is done using a calibration frame. In the 
following we assume that the images have already been corrected for radial dis- 
tortion. We use the conventional pinhole camera model, where the surface point 
P = [x, y, z]’^ in world coordinates projects to image point pk in camera k. The 
image coordinates and brightness intensity of this point under the assumption 
of Lambertian reflectance properties for the head are given by 



where / is the focal length, is the third row of Mfc, [P; 1] is the homogeneous 
representation of P, p(P) is the albedo of point P, Cfc is the constant that 
describes the brightness gain for each camera, n is the normal to the surface 
at P, and s the direction of incoming light (see ^). We assume that p(P) is 
constant over time {dp/dt = 0). In addition, since we record the video sequences 
with 60Hz, we have dj dt\n(P\t) ■ s(P; t)] = 0. Thus the brightness of pfc stays 
constant over short time intervals. 



_ Mfc[P;l] 

/(pfc; t) = -Ck ■ p(P) • [n(P; t) ■ s(P; t)] 



( 1 ) 

( 2 ) 
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There are many different ways to represent surfaces in space. For example 
B-spline surfaces, deformable models, iso-surfaces or level-sets, and polygonal 
meshes (for an overview see 0). Subdivision surfaces combine the ease of ma- 
nipulating polygonal meshes with the implicit smoothness properties of B-spline 
surfaces. Starting from a control mesh C, we recursively subdivide the mesh 
according to a set of rules until in the limit we generate a smooth limit surface 
S. Examples for different subdivision schemes can be found in m- 

A Loop-subdivision surface can be evaluated analytically jl4) for every sur- 
face point in dependence of its control points. We can express the position of any 
mesh surface point s as a linear combination of the subdivision control points c^, 
if we denote the limit matrix by Li , we can write s= Li Ci . The number of control 
points determines the degrees of freedom of the mesh. On each level i, we can 
express the position of the control points as a linear combination (denoted by 
the refinement matrices Ci, Di) of the control points on a coarser level Ci_i and 
the detail coefficients di_i that express the additional degrees of freedom of the 
new vertices inserted in the refinement step from level i — 1 to level i. Formally, 

Ci— 1 



we can write s = LiCi = Li[Ci-iDi_i] 
iterated (similar to 0) 



di-i 



The decomposition can now be 



S — LiCi — Li^Ci — \Di—i^ 

= L,[C,-iA-i] 

= L,[C,-iA-i] 



Ci — 1 
di-1 



( 3 ) 



Ci-2 Di_2 0 




Cz-2 


o 

o 

r 




^i-2 

di-1 



C.-2A-2 0 

0 0 A-i 



Co Do 0 ... O' 

0 0 Ii 0 0 

0 0 0 0 

0 0 0 0 A-i 



[co, do, di, . . . , di-i]^. 



This forms the basis of our coarse-to-fine algorithm. Starting from a coarse 
mesh with only few control points, we find the mesh that optimizes our error 
criterion. The mesh is then subdivided, thereby increasing the degrees of free- 
dom which enables the algoritm to adapt the mesh resolution according to the 
complexity of the surface geometry and motion field. 



3 Multi-camera Stereo Estimation 

We incrementally construct the object silhouettes in each view by modeling the 
temporal dynamics of the changing foreground pixels and the static background 
pixels. The intersection of the silhouettes in world space defines the maximal 
extent of the object. Using information from silhouettes alone, it is not possible 
to compute more than the visual hull 0 of the object in view. Therefore, to 
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refine our 3D surface estimate of the object, we adapt the vertices of the mesh 
in consecutive time frames to optimize the following error measure. 

Following equation 0 , we can assume that the brightness of the projection 
of P is the same in all images up to a linear gain and offset, and that it is only 
changing slowly over time. Each patch T on the object surfaces traces out a vol- 
ume in the spatio-temporal image space of each camera. The integration domain 
is the 1-ring of triangles around each control point. Thus, after determining the 
visibility of each mesh triangle in all the cameras by a z-buffer algorithm (V is 
the set of all pairs of cameras that can both observe the mesh patch), we define 
the following matching functional in space-time (similar to Pj) 



ar,t)= E + 



(*j)6V 



\Ii{t) \ ■ \Ii{t + 1)1 









\h{t -I- 1)1 • \Ij{t + 1)1 

j (/,(p,(P,t))-7i(t)).(/,(p,(P,f))-7^)dP 

PGT 

m= J h{p.{p,t))dp, |/,(t)p = (j,(t),7,(t)) 



Per 



( 4 ) 



( 5 ) 

( 6 ) 



to compare corresponding spatio-temporal image volumes between pairs of cam- 
eras. The correspondence hypotheses are based on the current 3D structure and 
motion estimate given by the position of the mesh control points at time t and 
t -I- 1. We combine the correlation scores from all the camera pairs by taking 
a weighted average with the weights W(i, j) depending on the angles between 
optical axes of cameras i and j and the surface normal at point P. 

The derivatives with respect to the control points can then be computed eas- 
ily from the derivative of the projection equation (Eq.Q) and the spatio-temporal 
image derivatives. We use the BFGS-quasi newton method in MATLAB^^ Opti- 
mization Toolbox to optimize over the control point positions. The upper bounds 
for their displacement is given by the silhouette boundaries, which we include 
as inequality constraints in the optimization. The multi-resolution framework 
improves the convergence of the optimization similar to multi-grid methods in 
numerical analysis, since the coarser meshes initialize the refined mesh already 
close to the true solution which allows us to reduce the size of spatio-temporal 
image volumes we compare, thus increasing efficiency and accuracy. 



4 Results 

We have established in our laboratory a multi-camera network consisting of 
sixty-four cameras, Kodak ES-310, providing images at a rate of up to eighty- 
five frames per second; the video is collected directly on disk. The cameras are 



Spatio-Temporal Analysis of Human Faces 



65 




Fig. 1. Calibrated Camera Setup with Example Input Views 



connected by a high-speed network consisting of sixteen dual processor Pentium 
450s with 1 GB of RAM each which process the data. 

For our experiments we used eleven cameras, 9 gray scale and 2 color, placed 
in a dome-like arrangement around the head of a person who was opening his 
mouth to express surprise (Figure 1) while turning his head and moving it for- 
ward. The recovered spatio-temporal structure enables us to synthesize texture- 
mapped views of the head from arbitrary viewing directions (Figures 2a- 2c). 
The textures, coming always from the least oblique camera with respect to a 
given mesh triangle, were not blended together to illustrate the good agreement 
between adjacent texture region boundaries (note the agreement between the 
grey- value structures in 2c despite absolute grey- value differences). This demon- 
strates that the spatial structure of the head was recovered very well. 

To separate the non-rigid 3D flow of the mouth gesture from the motion field 
due to the turning of the head, we fit a rigid flow field to the 3D motion flow field 
by parameterizing the 3D motion flow vectors by the instantaneous rigid motion 
dP /dt = V -I- u; X P, where v and lj are the instantaneous translation, and 
rotation (jl|). We use iteratively reweighted least squares to solve for the rigid 
motion parameters treating the non-rigid motion flow vectors as data outliers. 
By subtracting the rigid motion flow from the full flow, we extract the non-rigid 
flow. It can be seen that the motion energy (integrated magnitude of the flow) 
is concentrated in the non-rigid regions of the face such as the mouth and the 
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jaw as indicated by the higher brightness values in Figure 2d. In contrast, the 
motion energy on the rigid part of the head (e.g., forehead, nose and ears) is 
significantly smaller. In the close up view of the mouth region (Figure 2f) we can 
easily see, how the mouth opens, and the jaw moves down. The brightness values 
in Figure 2d are increasing proportionally with the magnitude of the motion 
energy. Although, it is obviously hard to visualize dynamic movement by static 
imagery, the vector field and motion energy plots (Figures 2d-2f) illustrate that 
the dominant motion - the opening of the mouth ~ has been correctly estimated. 

5 Conclusion and Future Work 

To conclude, we presented a method that is able to recover an accurate 3D 
spatio-temporal description of an object by combining the structure and motion 
estimation in a unified framework. The technique can incorporate any number 
of cameras and the achievable depth and motion resolution depends only on the 
available imaging hardware. In the future, we plan to explore other surface repre- 
sentations and to study the connection between multi-scale mesh representation 
and multi-scale structure of image sequences that observe them to increase the 
robustness of the algorithm even further. 
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(a) Three Novel Views 



(b) from the 





(c) Spatio-Temporal Model 



(d) Motk)n Rnerg>' of 
Non-Rigid 3D Flow 
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(e) Non-Rigid 3D Motion Flow (f) Non-Rigid Flow Close Up of Mouth 



Fig. 2. Results of 3D Structure and Motion Flow Estimation 
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Abstract. We present a PDE-based method for increasing the S/N 
ratio in noisy fluorescence image sequences where particle motion has to 
be measured quantitatively. The method is based on a novel accurate 
discretization of 3-D non linear anisotropic diffusion filtering, where the 
third dimension is the time t in the image sequence, using well adapted 
diffusivities. We have applied this approach to fluorescence image 
sequences of in vitro motility assay experiments, where fluorescently 
labelled actin filaments move over a surface of immobilized myosin. 
The S/N ratio can be drastically improved resulting in closed object 
structures, which even allows segmentation of individual filaments in 
single images. In general this approach will be very valuable when 
quantitatively measuring motion in low light level fluorescence image 
sequences used in biomedical and biotechnological applications for 
studying cellular and subcellular processes and in in vitro single 
molecule assays. 

Keywords: Anisotropic Diffusion Filtering, Fluorescence Microscopy, 
Image Restoration, Image Sequence Processing, Noise. 



1 Introduction 

Fluorescence imaging techniques have emerged to central quantitative tools in 
biomedical and biotechnological research. With the high spatial and temporal 
resolution they are ideally suited to study the spatio-temporal characteristics of 
cellular, subcellular and molecular processes in a wide variety of applications |H|. 
This includes the fundamental regulation of cell function by second messengers, 
e.g. Ca^+-ions, and its pathological alterations associated with many diseases. 
Further common applications are metabolic pathway analysis for drug design 
and single molecule studies for the basic understanding of the complex molecular 
interactions involved in the cellular processes. 

However, fluorescence image sequences from dynamic molecular assays pose 
very high demands on image processing, since the S/N ratio is generally low due 
to the limited amount of photons available for detection. Therefore the image 
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sequences obtained from these experiments require special sophisticated methods 
for image enhancing and a quantitative analysis. 

Therefore we have now developed a 3D anisotropic diffusion filter method 
as an extension of the common 2D anisotropic methods (see e.g. ™ 

order to use the full 3D information of these spatio-temporal datasets for image 
enhancement. Especially the enhancement and restoration of object features in 
individual images will be presented, which even allows the reliable segmentation 
and analysis of individual moving objects. 

2 Actin Filament Movement in the in vitro Motility Assay 

The in vitro motility assay originally devised by Kron and Spudich jSj is a typi- 
cal example, where dynamic molecular behavior is studied with highly sensitive 
fluorescence imaging (see figure GJ . It is routinely used to study the molecular 
origin of force production and its alterations by various medically relevant mod- 
ulators, as e.g. volatile anesthetics. The experimental set-up consists of a flow 
chamber, where isolated actin filaments, which are labelled with the fluorescence 
indicator rhodamine-phalloidin, move in the presence of adenosine triphosphate 
(ATP) over a surface of the immobilized motor protein myosin. The chamber 
is mounted on an epi-fluorescence microscope and a highly sensitive intensified 
CCD-camera is used to record the fluorescence. Since the fluorescence originates 
from single fluorescently labelled actin filaments with a diameter of 5 nm, which 
is much less than the microscopic resolution, the S/N ratio is very low. 

In previous work 0 we could show that the structure tensor method j2] can 
be successfully applied to the quantitative analysis of particle motion in low 
S/N ratio actin filament fluorescence images, where classical particle tracking 
approaches fail to produce reliable results due to massive segmentation errors. 
As a pixel based method the structure tensor is ideally suited to determine the 
velocity distributions of the moving actin filaments in this experiment. However, 
if individual particle properties, e.g. path length or filament flexibility, are of 
interest further analysis has to be carried out. 

In the following we will therefore present a new approach for analyzing actin 
filament motion in these fluorescence image sequences based on a 3D anisotropic 
diffusion filter method. 

3 Anisotropic Diffusion 

Every dimension of an image sequence will be treated as a spatial dimension, 
thus a 2D image sequence is treated as a 3D data set. The original image will 
be changed by applying anisotropic diffusion, which is expressed by a diffusion 
time t. Anisotropic diffusion with a diffusion tensor evolves the initial image u 
under an evolution equation of type 



du 



V • (DVu) 



( 1 ) 
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Fig. 1. Top row: 3 images from a time series of fluorescence images, where actin filament 
motion is visible as the displacement of rod-like structures. Bottom row: A standard 
threshold segmentation of an area of interest from the first image shows, that particle 
segmentation is difficult in these low S/N ratio image sequences due to the high amount 
of noise inherent to single molecule fluorescence images. Data from |2|. 



with the evolving image u{x,t), diffusion time t, 3 D derivation vector V = 
(dxi,dx2,dx3)^ and the diffusion tensor D, a positive definite, symmetric 3 x 3 
matrix. It is adapted to the local image structure measured by the structure 
tensor 0 

Jp(Vw5) = Gp* (2) 

with convolution *, a Gaussian Gp with standard deviation p, and := G^*u, a 
regularized version of u. The normalized eigenvectors of Jp give the preferred 
local orientations, its eigenvalues pi give the local contrast along these directions. 
The structure tensor is highly robust under isotropic additive Gaussian noise | 3 | . 
Using a diagonal matrix M with Mu = pi, Jp can be written 

Jp = (ei,e2,e3)M(ei,e2,e3)^ 

The diffusion tensor D uses the same eigenvectors e^. With the directional dif- 
fusivities Ai, A2, A3 and a diagonal matrix A with An = \ it becomes 

£’ = (61,62,63)71(61,62,63)'^. ( 3 ) 

The directional diffusivities Ai determine the behavior of the diffusion. They 
shall be high for low values of and vice versa. In our application we use 



— d 

Gi-^) 



A. := 



1 

1 — (1 — c) exp( 



if Pi < cr, 
) else. 



( 4 ) 
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where c G]0,1], d > 0 and ct > 0 corresponds with the global absolute noise 
level. The condition number of D is bounded by 1 /c. Instead of this choice other 
functions can be used. In comparative tests (see sec. we also use nonlinear 
isotropic diffusion with Tuckey’s biweight in 2D and 3D (see e.g.Q) and edge- 
enhancing diffusion in 2D (HDD- So far the continuous equations for anisotropic 
diffusion have been described. We now proceed by describing their discretization. 



4 Accurate Discretization Scheme Using Optimized 
Filters 



Eq. (^ can be solved using an Euler forward approximation for ^ 



- d 



i = V • DVu[ = u' -kr(a^j, 9^2: 19x3 )£'(3a;i, 9^,2, 9:^3 (5) 



where r is the time step size and u\ denotes the approximation of u{x, t) in the 
voxel i at (diffusion) time It. We use optimized separable first order derivative 
filters to discretize V • D\\7 (see (Z] for the 2D case). They are composed of a 
common ID derivative stencil (e.g. [-1, 0, l]/2, denoted V) and ID smoothing 
kernels (e.g. [3, 10, 3]/16, denoted B) in all other directions 



dxi = * Bxj * Bx^, ( 6 ) 

where {i, j, k} is a permutation of {1, 2, 3}, lower indices at the brackets give the 
direction of the kernel and * is a convolution. These filters have been derived 
in |tif4j . They approximate rotation invariance significantly better than related 
popular stencils. The following algorithm is repeated for every time step r: 

1. Calculate the structure tensor J (eq. Q). 

2. Get the diffusion tensor D by J (eqs. (0 and (0). 

3. Calculate the flux ji := Di.mdxm'^, V i G {1, . . . , n} 

4. Calculate V • (DlVu) (eq. (0 by V • iD[Vu) = 

5. Update in an explicit way (eq. Q). 

The iteration number is N and the total diffusion time T is T = tN. 



5 Results 

5.1 Validation with Synthetic Test Seqnences 

For the evaluation of image processing methods used for analyzing motility assay 
data we have created synthetic test data, as described in jS| , in order to charac- 
terize the algorithms under known conditions. They consist of rod-like objects, 
which move in x-,y- and diagonal direction or in approximate circles, a basis 
of all possible displacements. Various amounts of Gaussian noise can be added. 
As shown in figure |21 the displacement of the objects are 1 pixel/frame for the 
motion in x,y-direction and for the circular motion and \/2 pixel/frame for the 
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n= 1.001 p/f p= 1.416 p/f 

a= 0.006 p/f a= 0.004 p/f 




p= 1.062 p/f p= 1.013 p/f p= 1.399 p/f 

a= 0.345 p/f a= 0.081 p/f ct= 0.075 p/f 




0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 

V [pixel/frame] v [pixel/frame] 



Fig. 2. Synthetic test data used to quantify the accuracy of the velocity reconstruction. 
The rod-like objects (grey values 200) move in x,y direction, in a 45 degree angle and in 
approximate circles, respectively, versus a black background. The test pattern with no 
noise, shown in the left panel, has Gaussian noise with a standard deviation of 60 grey 
levels added (middle panel). Applying the 3D anisotropic diffusion filtering resulted in 
the panel shown on the right. The respective velocity histograms as obtained with the 
structure tensor method, are given below each panel (velocity in pixel/frame or p/f). 



diagonal direction. The addition of Gaussian noise with a standard deviation of 
60 grey levels leads to the loss of resolution of both peaks and to a broadening of 
the velocity distribution as described in |S|. Applying the 3D anisotropic diffu- 
sion filtering results in a significant improvement of image quality and moreover 
successfully restores both velocity populations. 

Even more important, the spatio-temporal filtering does not introduce signi- 
ficant errors in object velocities, as can be seen from the peaks of Gaussian fits 
applied to the histograms in figure 0 The error is below 5% for both velocity 
populations, which is remarkably since the original noisy data can not even be 
reliably analyzed as shown in the middle panel. Thus the 3D anisotropic dif- 
fusion algorithm presented here has both a very high accuracy and restoration 
power. 

5.2 Denoising of Actin Filament Fluorescence Sequences 

We applied the 3D anisotropic diffusion filtering to noisy image sequences of ac- 
tin filament movement in the in vitro motility assay and additionally compared 
the results to commonly used 2D diffusion filtering schemes. From figure 0 it 
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Fig. 3. Example of various diffusion filtering methods for the reconstruction of actin 
filaments in in vitro motility assay image sequences. A In the non processed original 
data the high amount of noise can be seen leading to non closed object structures for 
the filaments. B Edge enhancing diffusion in 2D leads to closed object structures but 
introduces significant errors at the leading and rear edge of the filaments. Nonlinear 
isotropic diffusion with Tuckey’s biweight in C 2D and D 3D can not reconstruct 
closed object structures. E The 3D anisotropic diffusion scheme produces closed object 
structures without morphological changes in filament shape. Original fluorescence data 
from 121. 



is evident, that the successful restoration of actin filaments with closed object 
structures can only be achieved when using the full spatio-temporal informa- 
tion in the 3D anisotropic diffusion scheme. Even the 2D anisotropic diffusion 
scheme can not compensate for the heavy degradation of object structures with- 
out introducing morphological errors. Therefore the extension of the existing 2D 
anisotropic diffusion schemes to a full 3D scheme proves to be a major advance- 
ment in analyzing noisy image sequences. 

6 Summary and Conclusions 

In summary we have presented a new approach to enhance noisy image sequen- 
ces, especially in low light level fluorescence microscopy. The additional use of 
the temporal information in the anisotropic diffusion filtering enables for the 
first time a reliable restoration of particle properties, normally not possible with 
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common 2D schemes. Additionally we have validated the method on computer 
generated test sequences, showing both the very high object restoration power 
and the very high accuracy in restoring the actual velocities of moving objects. 
We think that this approach is valuable in many other applications, where low 
light level dynamic processes have to be quantitatively analyzed by image se- 
quence processing. 
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Abstract. This paper focuses on the problem of ill-posedness of de- 
formable point set registration where the point correspondences are not 
known a priori (in our case). The basic elements of the investigated kind 
of registration algorithm are a cost functional, an optimization strategy 
and a motion model which determines the kind of motions and deforma- 
tions that are allowed and how they are restricted. We propose a method 
to specify a shape adapted deformation model based on thin-plate splines 
and point clustering and oppose it to the annealing of the regularization 
parameter and to a regular scheme for the warping of space with thin- 
plate splines. As criteria for the quality of the match we consider the 
preservation of physical/anatomical corresponding points. 

Our natural deformation model is determined by placing the control 
points of the splines in a way adapted to the superimposed point sets 
during registration using a coarse-to-fine scheme. Our experiments with 
known ground truth show the impact of the chosen deformation model 
and that the shape oriented model recovers constantly very accurately 
corresponding points. We observed a stable improvement of this accuracy 
for a increasing number of control points. 



1 Introduction 

Registration of spatial data is a known problem in computer vision and is of great 
importance for many applications pazi). For instance, in medical image analysis 
one may want to compare 3-D patient data from a CT scan with an anatomical 
atlas in order to detect anomalies. Further applications are model-based image 
segmentation and tracking of deformable objects over time uni. 

In order to register (or, “match” ) two images (or point sets) a parameterized 
transformation is applied to one of the images (the sensed image |I2])- The pa- 
rameters are then adjusted in such a way that the sensed image most accurately 
resembles the second image (the referenee image). A classical way of registration 
comprises three basic elements: a cost functional an optimization strategy and, 
after rigid and affine registration, the specification of the allowed deformations 
and their restrictions. We will denote this third element as the motion model. 
The motion model is necessary to overcome the ill-posedness of the optimiza- 
tion problem, i.e. to restrict the solution space for an optimal fit. Physically or 
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anatomically corresponding points are found implicitly. Approaches extracting 
critical points in both images explicitly and identifying them thereafter nm 
are very noise sensitive (therefore, proposes a semi-automatic algorithm) and 
problematical especially for non-rigid registration. Additional image information 
is lost when defining the transformation just by interpolating or approximating 
between the extracted corresponding points. 

For the registration of raw grey level images progress has been achieved to 
define a well suited cost functional and appropriate strategies optimizing it. 
Various so-called iconic similarity measures have been proposed |S| ■ Slowly con- 
verging non-gradient based optimization strategies such as downhill simplex or 
simulated annealing 0 have to be used. Non-rigid deformations are generally 
modeled by volume splines defined using a regular grid of control points jj|. 
On the other hand, methods operating on already extracted and reconstructed 
surfaces usually define the deformations only on the surfaces itself USE]. Ap- 
proaches using physically based elasticity models have been proposed in this 
context pp. In this paper we focus on the non-rigid registration of point sets. 
To use point set based methods for grey level image registration the application 
of low-level operators (edge detection, confinement tree extraction Eli etc.) is 
sufficient. They do not require surface reconstruction. Several cost functionals 
have been proposed mm- The three algorithms we present here minimize in 
accordance with H2| the sum of the squared least distances of each sensed data 
point from the reference point set. They differ only in the chosen motion model. 
Volume deformations are specified by interpolating (as in m between a number 
of displacement vectors located at certain spots in space. The end points of these 
vectors (the displacements) are the free parameters to be adjusted, whereas the 
start points — together with the amount of regularization (e.g., penalization of 
bending or strain | 2 | ) and a coarse to fine scheme — determine the motion model, 
i.e. the set of possible deformations of which the best fitting one is chosen. The 
idea is to place the start points in the best suited way and to interpolate with 
thin-plate splines in order to be able to place them at any position in space. 

We propose a new approach based on this idea. It is adapted to the shapes 
formed by the point sets. Using a point clustering approach, we place the start 
points where the discrepancy between the two surfaces is greatest. The two other 
methods are inspired by previous work. We choose the displacement start points 
on a rectangular multi-resolution grid similar to that used in P2| (where, how- 
ever, the non-rigid volume deformations are represented by trilinear B-splines) 
and either increase the number of control points level by level or we use a 
fixed number of control points and produce a coarse to fine behavior similar 
as in 1^ by the annealing of the regularization parameter. We compare the 
three different methods of choosing the motion model in terms of the achievable 
matching quality. Our criterion for the quality of the match is how good phys- 
ically/anatomically corresponding points of the two surfaces are brought into 
registration for a given number of transformation parameters. As will be shown, 
the motion model can have a strong influence on the matching quality. We do 
not focus on statistical robustness (e.g., treatment of outlier points) in this work. 
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2 Problem Formulation and Techniques 

For our purpose we state the non-rigid matching problem as follows. Given a 
reference set S C IR^ fixed in space, a set of N (sensed) sample points {r^ £ 
]R^|i = 1 . . -N}, and a deformation T : x 77 — >■ depending on a set 

of parameters P £ 77, with 77 being the parameter space. Now find the set of 
parameters P*, so that C{P*) = minpgijC(P), where 

N 2 

C(P) = ^(d(T(P,rd,5)) (1) 

i=l 

is the so-called cost or error functional. Here, d{-,S) is the Euclidean distance 
from the surface S: 



(7(x, S) = min |x — s|. 

The cost functional penalizes the sum of the squared distances of from S. 

To be accurate, the problem is still ill defined. In addition to the requirement 
that C is to be minimized we want the deformation to be “reasonable”. For 
example — according to the definition above — the problem would be considered as 
solved if all sensed points are mapped to one single point s on S, which is surely 
not the intended solution. The additional condition on the transformation is that 
“logically” corresponding points of the two shapes should be mapped to each 
other. For this, we specify a motion model in section including regularization 
(an additional term in the cost function, which penalizes strong bending; see 
equation ( 0 ) and stepwise matching from coarser to finer levels. As known, 
these strategies help to satisfy the correct-mapping criterion. 

For representing a non-rigid volumetric deformation T : x h->- T(x) 

we use thin-plate splines. Those are determined by a number of pairs of control 
points which are mapped to each other (or, equivalently, a set of displacement 
vectors, located at certain positions in space). We will call the control points in 
the untransformed volume (resp. the start points of the displacement vectors) 
A-landmarks, or p^, and their homologs in the transformed volume (resp. the 
vector end points) B-landmarks, or q^. In the space between the control points 
the deformation is interpolated by thin-plate splines. Descriptively speaking, 
one may think of the space as a rubber cube that can be pulled in different 
directions at certain working points. A precise mathematical description of thin- 
plate spines in this context is given in Bookstein, 1989 0. For registration, the 
A-landmarks are fixed using the schemes described in the next section. The 
positions of the B-landmarks are the free parameters of the deformation: P = 
{q^., Qy^ql, . . . , qf:, qy, q'f) with n the total number of control point pairs. 

Thin-plate splines are minimally bent while satisfying the interpolation con- 
straints given by the control points (T(p^) = q*^): 
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The algebra of thin-plate splines allows to compute this quantity easily from the 
positions of the control points. We add this value to the cost function as as a 
regularization term, to prevent unintended effects arising from overly bending: 

f^total = C ~h CrC/bend- (2) 

a is a weighting factor to be chosen (Please note that not the control points but 
all data points enter in C; therefore equation (0) has not to be confused with 
work on approximating thin-plate splines nag). The minimizing is done by the 
Levenberg-Marquardt algorithm, a standard derivative based non-linear mini- 
mizing method (cf.. Press et al. |S|). The necessary computation of dCtoted/dP 
is straight forward due to the simple algebra of thin-plate splines. For a fast 
estimation of d(x, S) and its derivative with respect to x (also necessary to 
compute dCtotai/dP) we use a precomputed distance map (“octree spline”) as 
proposed in m- Here, the values of d(x, S) are computed only for the corners of 
a multi-resolution grid around the reference set, and stored. Then, d(x, S) can be 
estimated for an arbitrary point by finding the eight nearest corner values and 
interpolating between them. Since the interpolating function is differentiable in 
between the corners the derivative can also be assessed. 

2.1 Choosing a Motion Model by Placing the A-Landmarks 

While the B-landmarks serve as transformation parameters, being adjusted by 
the Levenberg-Marquardt algorithm, the positions of the A-landmarks stay con- 
stant during the matching process and have to be fixed initially. Their placing — 
together with the amount of regularization — determines the motion model. It 
turns out that the choice of the positions of the A-landmarks can have a strong 
impact on the matching quality. In this work we investigate the effects of the 
chosen motion model, and we evaluate three different approaches, all of which 
incorporate the coarse-to-fine idea. Approaches A and B are inspired by previous 
work, method C is the main contribution of this paper. 

A: Octree Method (level- by- level). For choosing the positions of the A- 
landmarks, we employ an octree representation and see Figure 0 of the 

untransformed sensed surface. The octree is built as follows. Starting with an 
initial cube containing the shape of interest, each cube is recursively split into 8 
sub-cubes unless a cube doesn’t contain parts of the shape, or a pre-set minimum 
cube size is reached. Obviously, with this process a hierarchical multi-resolution 
grid is formed, which has a finer resolution close to the shape. We place one A- 
landmark in the center of each octree cube. In the coarsest step we use a 2-level 
octree, resulting in 9 A-landmarks (one from the initial cube plus 8 from the 
8 sub-cubes of the second octree level). The transformation is initialized with 
the identity mapping (q^ = p^). After running Levenberg-Marquardt, yielding 
the transformation we add more A-landmarks, corresponding to the next 
finer octree level. The homologous new B-landmarks are initialized with q*’ = 
TT)(p*^) and another run of Levenberg-Marquardt results in the transformation 
This procedure is repeated until the finest desired level is reached. 
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(a) (b) (c) 



Fig. 1. (a) 5-level octree associated with the shown 3-D surface (orthogonal projection) 
and corresponding A-landmarks (level l-f only), different levels drawn in different col- 
ors. (b),(c) Matching of two cell nuclei. Filled: reference surface; wire-frame: sensed 
surface. A grid is superimposed to show the occurring volume deformation, (a) After 
affine registration, (b) after thin-plate spline registration with method A, level 4- The 
small balls are the B-landmarks. 



B: “Annealing”-Method (octree- based). The coarse-to-fine strategy, nec- 
essary to prevent early local deformation which can result in wrong registrations, 
is incorporated not by increasing the number of control points but by letting the 
parameter a in decrease stepwise from “infinity” while after each change of 
a the minimization algorithm is run once. The landmarks are inserted at once 
for all desired levels of the same octree as in A. The decrease schedule was expo- 
nential with five steps, where the minium value of a was the same as in method 
A. Increasing the number of steps didn’t improve matching results (see section 

C: Distance Cluster Method. The idea behind this method is to place the 
A-landmarks where the displacement between the two shapes is strongest. As in 
method A we follow a coarse to fine scheme by incrementally adding more control 
points. In each step, the positions of the new A-landmarks are determined as 
follows: We apply a K-Means clustering on the sensed data points, where the 
mean of each cluster is computed as a weighted center-of-mass with the “mass” 
of each data point being dependent on the distance of T*^"^(ri) from the 
reference surface: 

where A; is the set of the indices of data points associated with the cluster 
center. Taking the 3”'^ power of d is an heuristic proposal. After convergence of 
the K-means algorithm the CM^ will lie preferably in regions of strong distances 
d(T(ri), S'). The A-landmarks are inserted at the CM^. As in method A, is 
the identity transformation at the start, and in further steps the current thin- 
plate spline deformation. Again, the corresponding B-landmarks are initialized 
according to that transformation. 




A Point Set Registration Algorithm Using a Motion Model 



81 



In the first step, 8 additional landmarks are set on the corners of a bounding 
box containing the sensed surface, and only one landmark at the CM of the whole 
shape (K-Means with k = 1). How many new landmarks to add in each of the 
further coarse-to-fine steps is a thing left to be investigated. In our experiments 
it turned out to be advantageous to add I + 3n new A-landmarks, where n is the 
number of steps already taken. In the last step, this rule was trimmed in order 
to reach the desired number of control points. 

3 Experimental Results 

We took the surface of a cell nucleus, reconstructed from confocal laser mi- 
croscopy image data, and deformed it with 9 different random thin-pate-spline 
deformations. These deformations consisted in 3 successive steps: one coarse de- 
formation (7 random control points, displaced randomly by a relatively large 
distance), one intermediate step (20 points with moderate displacements), and 
a fine deformation (40 points, displaced by a relatively small distance). We then 
matched the deformed cells to the original reference point set, using the 3 dif- 
ferent motion models on 3 different levels of fineness (all in all 54 experiments). 

To evaluate the matching quality we took the distances of all deformed sensed 
points from their corresponding reference points (difference from ground truth) : 

N 

i=l 

where are the sensed data points, m.i are the reference data points, and N 
is the number of data points. In contrast, the cost function m measures the 
distances of the sensed data points from the nearest reference point. 

Figure 121 shows the values of ^ for all the experiments. We can see that the 
cluster method (C) gives the best quality (low S') on both levels of fineness. 
This is also true for an even coarser level of 3, resp. 9 landmarks (not shown). 
The “annealing” method (B) produces higher error values when we get to the 
finer level. This instability is probably the consequence of early local registra- 
tions due to the initially high number of control points. Further refinement of 
the “annealing rate” didn’t improve the results. Methods A and B in contrast 
produce monotonously better matches with more degrees of freedom. 

Another thing to note is that the values ^ of the cluster method (C) on the 
coarse level are comparably good as (sometimes better than) method A on the 
fine level. This suggests that the cluster method is well suited to establish an 
appropriate motion model using only relatively few adjustable parameters (69 
vs. 271 control points). This drastically decreases computation time, since its 
dependence on the number of control points is approximately quadratic. A low 
number of transformation parameters is also helpful for the concise description 
of the deformation and for the comprehension of the mechanisms behind it. 
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Fig. 2. 9 matches have been executed for each of the three A-landmark placing methods 
and on two degrees of fineness. The diagram in the upper left corner displays 
the values of (our measure for the matching guality, lower values are better) for all 
experiments. The symbols representing method C have been connected by lines for em- 
phasis. The second column shows only the results for the coarse level, corresponding 
to an octree level of four (64 control points (CP) in average). For the cluster method 
we took also 6f CP in total. Shown are the values of F, the occurring bending energy 
Ubend, and the cost C. The third column shows the same data for the fine level, 
corresponding to an octree level of five, resp. 271 CP in average. For method C, 271 
CP, the values ofC are way below those for methods A and B. In order to have a better 
basis — when comparing F — to assess the ability of the methods to find correct associa- 
tions, we decreased the number to 125, resulting in a value of C similar to methods A 
and B. The result is shown in the lower left corner. 



4 Conclusion 

We showed that in deformable registration choosing an appropriate motion 
model can strongly improve the matching quality and can allow an immense 
decrease of the number of parameters to be adjusted and hence of computation 
time. The presented method was able to save 75% of the free transformation 
parameters in one experiment while preserving matching quality (method C, 
coarse, 1:25 min average execution time, vs. method A, fine, 64 min; see Figure 
0. This permits also a very concise description of the deformation. 
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Chui and Rangarajan ^ proposed a thin-plate spline based point matching 
algorithm using deterministic annealing as a minimization method. In their ap- 
proach the coarse-to-fine idea enters by slowly decreasing the regularization pa- 
rameter, being dependent on the annealing temperature. Our results give strong 
evidence that doing coarse-to-fine matching solely through regularization is not 
optimal and especially sensitive to the number of degrees of freedom. 

Further work will include the comparison between the use of thin-plate splines 
(TPS) and of trilinear B-splines to interpolate the volume deformations in the 
motion model. Therefore, we have to produce ground truth (i.e., synthetic image 
pairs) using non-TPS transformations for establishing a more objective valida- 
tion. We will also test the algorithm with natural image pairs. Due to the lack 
of a ground truth we evaluate matching quality by comparing manually selected 
corresponding points. 
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Abstract. This paper presents an optimal estimate for the projection 
matrix for points of a camera from an arbitrary mixture of six or more 
observed points and straight lines in object space. It gives expressions 
for determining the corresponding projection matrix for straight lines to- 
gether with its covariance matrix. Examples on synthetic and real images 
demonstrate the feasibility of the approach. 



1 Introduction 

Determining the orientation of a camera is a basic task in computer vision and 
photogrammetry. Direct as well as iterative solutions for calibrated and un- 
calibrated cameras are well known, in case only points are used as basic enti- 
ties. Hartley and Zisserman (cf. pp. 166-169) indicate how to use point and 
line observations simultaneously for estimating the projection matrix for points. 
However, they give no optimal solution to this problem. 

We present a procedure for optimal estimation of the projection matrix and 
for determining the covariance matrix of its entries. This can be used to derive 
the covariance matrix of the projection matrix for lines, which allows to infer 
the uncertainty of projecting lines and planes from observed points and lines. 

The paper is organized as follows: Section 2 presents some basic tools from 
algebraic projective geometry, gives expressions for the projection of 3D points 
and lines from object space into the image plane of a camera and explains the 
determination of the matrix Q for line projection from the matrix P for point 
projection. Section 3 describes the procedure to estimate P and to derive the 
covariance matrices of the orientation parameters. Finally we give examples to 
demonstrate the feasibility of the approach. 

2 Basics 

We first give the necessary tools from algebraic projective geometry easing the 
computation of statistically uncertain geometric entities. 



B. Radig and S. Florczyk (Eds.): DAGM 2001, LNCS 2191, pp. 84-|21| 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 



Optimal Camera Orientation from Points and Straight Lines 



85 



2.1 Points, Lines, and Planes and Their Images 

Points = (xg,Xh) and lines 1 = in the plane are represented with 

3-vectors, splitted into their homogeneous part, indexed with h, and non- 
homogeneous part indexed with 0. Points X = and planes A = 

(A^,Ao) in 3D space are represented similarly. Lines L in 3D are represented 
with their Plucker coordinates = (L~[, LJ). It can be derived from two points 
X and Y by the direction Lh = Y^Xq — X^Yq of the line and the normal 
Lq = Xq X Yq of the plane through the line and the origin. We will need 
the dual 3D-line L = which has homogeneous and non-homogeneous 

part interchanged. The line parameters have to fulfill the Plucker constraint 
L 1 L 4 + L 2 L 5 + L^Lq = LJ^Lq = |L^L = 0 which is clear, as L — YhXg — XhYo 
is orthogonal to Lq = Xq x Yq. All 6 - vectors L 0 fulfilling the Plucker con- 
straint represent a 3D line. 

All links between two geometric elements are bilinear in their homogeneous 
coordinates, an example being the line joining two points in 3D . Thus the 
coordinates of new entities can be written in the form 

j = A{a)l3 = B{l3)a ^ = A{a) |^=B(/3) 

Thus the matrices A(a) and B(/3) have entries being linear in the coordinates a 
and p. At the same time they are the Jacobians of 7 . 

We then may use error propagation or the propagation of covariances of linear 
y = Ax functions of x with covariance matrix leading to Syy = AS^x^^ to 
obtain rigorous expressions for the covariance matrices of constructed entities 7 , 

= A{a)Si3i3A^{a) + B(,9)r„„B'^(/3) 

in case of stochastic independence. 

We need the line L as intersection of two planes A and B 

^ = = (1‘) = "WB = -m.B)A 

inducing the matrix TT (A) depending on the entries of the plane A. We also 
need the incidence constraints for points x and lines 1 in 2D, for points X and 
planes A in 3D and for two lines L and M in 3D: 

x"^! = 0 X'^A = 0 L'^M = 0 (1) 



2.2 Projection of Points and Lines in Homogeneous Coordinates 

Point projection. As explained in 0, the relation between a 3D point in 
object space and its image point in the image plane can be written as 



x' = PX with 



Psx4 = 




( 2 ) 
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where X is the homogeneous coordinate vector of a point in object space, x' the 
homogeneous coordinate vector representing its image point and P the 3x4- 
matrix for point projection. Due to the homogeneity of P it only contains 11 
independent elements. Its rows A^, and can be interpreted as homoge- 
neous coordinates of the three coordinate planes = 0, X2 = 0 and Xg = 0 
intersecting in the projection center Xc of the camera (cf. 0). Analogously, the 
columns of P can be interpreted as the images of the three points at infinity of 
the three coordinate axes and of the origin. 



Line projection. For line projection, it holds (c. f. 0) 



1' = QL with Q = 

3x6 




/(BnC)"^\ 

(cnA)'' 

V(AnB)V 



/TT(B)C\ 

TT(C)A 

Vtt(a)b; 



/-TT(C)B\ 
-TT(A)C (3) 

V-TT(B)A; 



where the 6x1 vector L contains the Pliicker coordinates of the straight line 
dual to the line L and 1' denotes the coordinates of the image of L. As shown in 
0, the rows of the 3x6 projection matrix Q for line projection represent the 
intersections of the planes mentioned above. Therefore they can be interpreted as 
the Pliicker coordinates of the three coordinate axes X2 = Xg = 0, Xg = x'j^ = 0 
and x']^ = X2 = 0. 

Inversion. Inversion of the projection leads to projection rays L' for image 
points x' and for projection planes A' for image lines F 

L' = q'^x' a' = P"^!' (4) 



The expression for L' results from the incidence relation x'^1' = 0 (cf. (Et.)), for 
all lines 1' = QL passing through x', leading to (x'^Q) L = L'^L = 0 using ( 0 ). 
The expression for A' results from the incidence relation I'^x' = 0 for all points 
x' = PX on the line 1', leading to (F^P) X = A'^X = 0 using 0)). Especially 
each point X on the line L lies in the projecting plane A', therefore it holds 

F'^PX = 0. (5) 



2.3 Determination of Q from P 

For a given matrix P together with the covariance matrix of its elements 
/3 = (A^,B^,C'')^ we easily can derive the matrix Q = (U, V,W)^ using ®. 
By error propagation we see that the covariance matrix of 7 = (U, V, W)^ 
is given by 

E~^,y = lVIX,g^]VI 

with the 18 x 12 matrix 

/ 0 -TT(C) TT(B) \ 

M = TT(C) 0 -TT(A) . 

V-TT(B) TT(A) 0 J 
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In a similar manner we may now determine the uncertainty of the projecting 
lines and planes in 0. 

3 Estimation of the Projection Matrix P 



There are several possibilities to determine the projection matrix P for points 
from observations of points or straight lines in an image. 

For the case that only points are observed, eq. 0 allows us to determine P 
using the well known DLT algorithm (cf. @], p. 71). If no points but only straight 
lines are observed, we might use a similar algorithm based on eq. © to determine 
the matrix Q for lines. Then we need to derive P from Q, which is possible but 
of course not a very elegant method for determining P. Integrating point and 
line observations would require a separate determination of P independently 
estimated from observed points and lines and a fusion of the two projection 
matrices, which, moreover, would require a minimum of 6 points and 6 lines. 

We therefore follow the proposal in Pj and use m for points and for lines, 
as they both are linear in the unknown parameters of the projection matrix P. 
It only requires at least six pints and lines, however, in an arbitrary mixture. We 
here give an explicit derivation of the estimation process which not only gives 
an algebraic solution but a statistically optimal one. 



Observations of points. For each observed point x' = i = 

1, . . . , / it holds (cf. (El) 

w', C^X, 

which leads to the two constraints 

= 0 which are dedicated to estimate the elements of P. In matrix 
representation, these two constraints can be formulated as bilinear forms 



and 



w'i 



CTX,: 



'C^X,, - w'ATX. = 0 and ?;'CTX,; - 



A, (x')^ = 

B. (/3)x' = 



-rc'Xj 0 u'Xn _ 


resp. 


(6) 


C^Xi 0 -a^xA 
0 C^x, -B^xJ 




(7) 



where the vector /3^ = (A^, B^, C^) contains all unknown elements of P. Eq. (0 
will be used to estimate the 13, 0 will be used for determining the uncertainty 
of the residuals by error propagation. 

As each observed point x' yields 2 constraints to determine the 11 unknown 
elements /3, / > 6 observed points would be needed if only points are observed 
in the image. 
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Observations of straight lines. For each observed straight line 1' = 

(o' , 6' , c' )^, j = 1, . . . , J eq. © is valid. 

Thus, if Xaj and are two points lying on the 3D line with image F, 
one yields the two constraints 



1 ' PXaj = 0 



resp. 






bjX^j'B + = 0 and 



= 0 



fllXpi ;A 



resp. 

or in matrix representation as bilinear constraints 

bj^EJ ^j^E.j, 



+ c,x]; ,C = 0 




/3 = ej 



X;_^-A X^,^-B XX,^.C^ 



i;=e, 



^E,j' 



resp. 



( 8 ) 

(9) 



Again, © will be used to determine (3, while © will be used for determining 
the uncertainty of the residuals by error propagation. 

We see that each observation of a straight line in an image yields 2 constraints 
to determine P. Thus again, if only straight lines are observed, J > 6 lines are 
needed. 

Note that in the following entities concerning observations of points are indexed 
with i and entities concerning observations of straight lines are indexed with j. 



Parameter estimation. Now we seek an optimal estimate for f3. Due to the 
homogeneity of P we search for an optimal estimate just under the constraint 
1/3 1 = 1. Thus, we optimize 

32 = ^ej(^) X-^/3) e,(/3) + ^eJ(/3) X-/,^.(/3) e,(/3) 

i 3 



under the constraint |/3| = 1. In case the observations are normally distributed 
this is the ML-estimate (under this model). 

The solution can be achieved by iteratively solving 






+ E (D,(/3(‘'))rqq d/(/ 3(^))) ' C,(l,) 



|/3(‘'+i)| = l 



(cf. 0 ). Observe, this is the solution of an ordinary eigenvalue problem with a 
non-symmetric matrix, as the left factor, e. g. Ai^(x/|^^) uses the fitted obser- 
vations \ whereas the right factor Ai(x') uses the observed values. 
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The covariance matrices of and e^- are given by 
and D,(/3^"))i:i-/'D/(/3M), so that they have to use the previous estimate 

The procedure is initiated with T’e-g. = = I and using the observed 

values x' and 1' as initial values and for the fitted observations. This 

^ J ^ J 

leads in the first step to the classical algebraic solution (cf. P, p. 332, p. 377) 
and no approximate values are needed. 

The estimated observations x/-'^^ and needed in (cni) can be calculated 
by (cf. El) 

B,(/3("))^ x' (11) 

T/J") = (^1 - (D,(/3(^))ri.,,B/(^(‘^)))'' D,(/3(^))^ 1' (12) 

With the final weight matrix i7+ = Diag(Diag(T’+gJ, Diag(T'+g^.)) of the 
contradictions and ej the covariance matrix 



Sa = Diag(Diag(r,c'a:'),Diag(ri/i/)) 

of the observations, the covariance matrix of the estimated values is given by 

^ = E (B,(3)r,.,;B/(3))”' A,(x') 

i 

+ E c/CP,) (D,(3)r,,,, d/(3))”' c,(1;-) 

3 

The pseudo inverse M+ can be determined exploiting the fact that /3 should 
be the null-space of M + . We obtain an estimate for the unknown variance fac- 
tor (T^ = f2/R with R — rk(i7ee) — H with the redundancy R and the 
weighted sum of squared residuals of the constraints 12. The redundancy re- 
sults form the fact that we effectively have rk(X'ee) constraints which have to 
determine 11 unknown independent parameters of P. Therefore the estimated 
covariance matrix of the estimated parameters is given by 

^/3/3 “ ^ 



4 Examples 

4.1 Synthetic Data 

The presented method was applied on a synthetical image, only using straight 
lines as observations. The straight lines used connect all point pairs of a synthetic 
cube. The cube has size 2 [m] and is centered at the origin, the projection center 
is (10, —3,4)^ [m] and therefore has a distance of approx. 11.2 [m]. The principal 
distance is 500 [pel], the assumed accuracy for the observed lines is 1 [pel], which 
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Fig. 1. Synthetic test data, (a) Ideal observations of lines in an image of a synthetic 
Cube, (b) Erroneous observations, assuming an accuracy of l[pel] referring to the end 
points. 



refers to the two end points. The cube appears with a diameter of approx. 200 
[pel] , thus the viewing angle is approx. 23° and the measuring accuracy is approx. 
1 : 200. The figure shows the image of the ideal cube and the image of the 
erroneous lines. The estimated projection matrix, determined from 28 observed 
lines, differs from the ideal one by 0.6 %, an accuracy which is to be expected. 



4.2 Real Data 



The presented method to determine P was also applied on real data, again using 
only observations of straight lines. The test object has the shape of an un- 
symmetric T with a height of 12 cm and a width of 14.5 cm (cf. Fig. 14 . 211 . It 
is not very precise, so that object edges are not very sharp. Object information 
was gained manually by measuring the lengths of the object edges, supposing 
that the object edges are orthogonal. An image of the test object was taken (cf. 
Fig. 14.2k ) using a digital camera SONY XC77 CE with a CCD chip of size 750 
X 560 pixel. From the image straight lines were extracted automatically using 
the feature extraction software FEX (cf. P|), leading to the results shown in Fig. 
Ob . 




Fig. 2. Real test data, (a) Image of the test object, (b) Results of the feature extraction 
with FEX. 
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With revised data (short and spurious lines were thrown out) the parameter 
estimation delivered the estimate 



P = 



/ 0.00295953 -0.00320425 -0.00106391 0.564433 \ 
0.00397216 0.00257314 0.0003446 0.743515 

\ 0.00000259 0.00000141 -0.0000050 0.00246196; 



with the relative accuracies ^PijPij given in the following table: 



(ij) 


1 


2 


3 


4 


1 


0.883 % 


0.598 % 


1.711 % 


0.116 % 


2 


0.964 % 


1.492 % 


9.504 % 


0.068 % 


3 


3.013 % 


8.218 % 


1.393 % 


0.049 % 



The relative accuracies indicate that most elements of P are determined with 
good precision, particularly in view of the fact that the object information we 
used is not very precise and that the observed object edges are not very sharp. 
Obviously, the relative accuracy of some of the elements of P is inferior to the 
others. This is caused by the special geometry of the view in this example and 
not by the estimation method applied. 



5 Conclusion 

The examples demonstrate the feasibility of the presented method to estimate 
the projection matrix P for points. The method is practical, as it needs no 
approximate values for the unknown parameters and statistically optimal, as it 
leads to the maximum likelihood estimate. Therefore we think that it could be 
frequently used in computer vision and photogrammetry. 
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Abstract. We present a method for estimating unknown geometric enti- 
ties based on identical, incident, parallel or orthogonal observed entities. 

These entities can be points and lines in 2D and points, lines and planes 
in 3D. We don’t need any approximate values for the unknowns. The 
entities are represented as homogeneous vectors or matrices, which leads 
to an easy formulation for a linear estimation model. Applications of the 
estimation method are manifold, ranging from 2D corner detection to 
3D grouping. 

1 Introduction 

Constructing geometric entities such as points, lines and planes from given en- 
tities is a frequent task in Computer Vision. For example, in an image we may 
want to construct a junction point from a set of intersecting line segments (cor- 
ner detection) or construct a line from a set of collinear points and line segments 
(grouping). In 3D, a space point can be reconstructed by a set of image points 
(forward intersection); as a last example, one may want to construct a polyhedral 
line given incident, parallel or orthogonal lines and planes (3D- grouping). 

These geometric constructions can be described as an estimation task, where 
an unknown entity has to be fitted to a set of given observations in the least- 
squares sense. Unfortunately the underlying equations are nonlinear and quite 
difficult to handle in the Euclidean case. Because of the non-linearity, one needs 
approximate values for the unknowns, which are not always obvious to obtain. 

This article presents a general model for estimating points, lines and planes 
without knowing approximate values. We propose to use algebraic projective 
geometry in 2D and 3D together with standard estimation methods. We first 
explore possible representations for geometric entities, obtaining a vector and a 
matrix for each entity. Then we express relations between the entities by simple, 
bilinear functions using the derived vectors and matrices. These relations can be 
directly used in a parameter estimation procedure, where the initial value can be 
obtained easily. We show possible applications and some test results validating 
the performance of the proposed method. 

The proposed estimation model is based on (i) algebraic projective geometry, 
which has been has been extensively promoted in Computer Vision in the last 
decade (cf. 0,101) and (ii) integration of statistical information to the projective 
entities. Kanatani 0 presented a similar approach though he does not make full 
use of the elegant projective formulations, leading to complex expressions. 
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2 Representation of Geometric Entities 

Assuming that both our observations and unknowns could be points, lines and 
planes in 2D and 3D, we first want to explore possible representations for these 
entities. 

2.1 Vector Representation 

Following the conventions in points and lines in 2D are represented as ho- 
mogeneous vectors x, y and 1, m; in 3D we have X, Y for points, L, M for lines 
and A, B for planes. Euclidean vectors will be denoted with italic bold letters 
like X for a Euclidean 2D point, cf. the first column in tabled Furthermore we 
will use the canonical basis vectors = (0, ..., 1, ..., 0) for points, lines or 

i 

planes depending on the context. 

For the geometric entities, each homogeneous vector contains a Euclidean 
part, indexed with a zero, and a homogeneous part, indexed with an h. This 
is chosen such that the distance of an element to the origin can be expressed 
as the norm of the Euclidean part divided by the norm of the homogeneous 
part, e.g. the distance of a plane to the origin is given by = |Ao|/|A/i|. 
The uncertainty of an entity can be described by the covariance matrix of this 
vector. Note that a line L = (T^, TJ) is a 6- vector which has to fulfill the Pliicker 

condition L L = 0 with the dual line L = (TJ, L~l). 




Fig. 1. (a) A line 1 and the three common points Xi(l) with the three lines ei. (h) A 
point X and the three lines L(x) incident to x and the points e^. (c) A line L and the 
four common points Xi(L) with the four lines Ei. (d) A point X and the four lines 
Li(X) incident to X and the points Ei. 

2.2 Matrix Representation 

An entity is represented by a vector using its coordinates, but one can also use 
another representation: for example, a 2D line can be represented by its three 
homogeneous coordinates (a, b, c), or implicitly by two points xi(l) and X2(l) on 
the X- resp. y-axis, see fig. ^a). Additionally, one can choose the infinite point 
X3(l), which lies in the direction of the line and is undefined, thus (0, 0, 0)^, if the 
line is at infinity. These three points can be constructed by Xi...3(l) = 1 fl 61.3. 
Writing xi 3(1) in a 3 x 3 matrix yields the skew-symmetric matrix S(l) as 
defined in the second row of the table E 

The same argument can be used to derive a skew-symmetric matrix S(x), 
which contains three lines li,,.3(x) incident to the point x, see fig. C^b). These 
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Table 1. Points, lines and planes in 2D and 3D represented as homogeneous vectors 
X, 1 and matrices S(x),S(l), resp in 3D represented as homogeneous vectors X, L, A 
and matrices TT^(X), r(L), TT^(A) . Each row of a homogeneous matrix represents a 
vector of the other entity. All three row entities of a matrix in turn define the given 
entity, see text for details. 



2D 


vector 


matrix 


point 


x"^ = {xl,Xh) = 
{u,v,w) 


/ 0 —w V \ 
S(x) = 1 m 0 — w I = 
\ —V u 0 J 


/li(x)'\ /(lnei)'\ 

b(x)^ = (ine^)^ 

Vl3(x)V V(lne3)V 


line 


= {ll,lo) = {a, b-c) 


/O -c b\ / 

S(l) = I c 0 —a 1 = 

\-b a 0 J V 


X2(l)^ = (lAe^)^ 

X3(1)7 \(lAe3)V 


3D 


vector 


matrix 


point 


X"T = (xJ,Xfc) = 
{U,V,W-,T) 


-n-'(x).(ATA‘oT’) 


/to 0 0 w -v\ 

/ 0 T 0 -W 0 C/ \ 

— 1 0 0 T V -C7 0 1 

\-U -V -W 0 00 / 


line 


L^ = {Ll,Ll) = 

(Li, L 2 , L 3 ; Li, Lg, La) 


f(L)=r(L)= 


\ / 0 T 3 - 1/2 -^^4\ 

\ _ -L 3 0 Li -Eg ] 

1 1 1/2 -Li 0 -Lq I 

/ \ L 4 Ls Lq 0 / 

r \ / 0 Lg -I /5 -Li\ 

■^h\_ -Lq 0 I /4 -L 2 ) 

0 ) 1 ^5 -^4 0 -^3 } 

^ J \ i/i L 2 I /3 0 y 


plane 


AT = (at,Ao) = 
(A,B,C-,D) 




/O C -B D 0 0\ 

1 -C 0 A 0 £) 0 i 

— 1 B -A 0 0 0 D 1 

\-A -B -C 0 0 



matrices have rank 2, because only two rows in the matrix S(*) are needed to 
construct the third one. 

In a similar manner we can define matrices for points, lines and planes in 
3D: for example, a point X can be defined by a set of four intersecting lines 
Li,,. 4(X), where three lines are each parallel to one axis and the fourth line 
intersects the origin, see fig. Ed). Since a line is represented by a 6- vector, we 
obtain a 4 x 6 matrix TTT(X) = (Li(X)T...L 4(X)T) as a representation of a 
point X. Note that the six columns of TT^(X) can be interpreted as six planes, 
which are incident to the point X and to the six space lines E^. 

A space line can be determined by four points Xi,..4(L), three of them are 
on the axis-planes E1.. 3, the fourth one lies on the infinite plane E4 in direction 
of the line, see fig. GJc). These four points define 4x4 skew-symmetric matrix 
r(L). Only two rows of the matrix r(L) are linearly independent, the selection 
of the rows depend on the position and direction of the line. As a space line can 
also be represented by four planes Ai.,,4(L), we can define a line matrix based 
on these for planes, yielding r(L), cf. tabled In the literature, r(L) and r(L) 
are referred as Pliickermatrices, e.g. in jS|. 

The matrix of a plane can be defined dually to the matrix of a point by 
TT^(A), as in tableni right column, last row. Note that S(x), S(l), TT(X), F(L), 
TT(A) are not only representations of points, lines and planes, but also the 
Jacobians for constructing elements by join A and intersection n, see 
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3 Relations between Unknown and Observed Entities 

When estimating an unknown entity, all the observations are related to the un- 
known in some way. In this work we assume the relations to be either incidence^ 
equality, parallelity and orthogonality; for example an unknown space line M 
may have a list of incident points X^, incident lines Li and parallel planes 
as observed entities. In this section we want to derive algebraic expressions for 
a relation between two entities. 

First we want to focus on incidence and equality relations in 3D: the simplest 
incidence relation is a point-plane incidence, since it is only the dot product of 
the two vectors: X^A = 0. Furthermore, the incident relation of two lines can be 
expressed by a dot product: M L = 0. From X^A and M L we can derive all 
other possible incidence and equality relations: for example, if two lines L and 
M are equal, each of the four points Xi 4(M) must be incident to each of the 
four planes Ai...4(L). Therefore the matrix product of r(L) r(M) = 0 must be 
zero. Another example is a line L which is incident to plane A if the four lines 
Li,,. 4 (A) are incident to the line L. All possible incidence and equality relations 
between unknowns and observations are listed in table 01 

Table 2. Possible incidence, equality, parallelity and orthogonality relations between 
lists of observed points and of lines L and unknown points y and lines m. 
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Table 3. Possible incidence and equality relations G,= between lists of observed space 
points Xi, lines Li, and planes Ai and unknown space points Yi, lines Mi and planes 
Bi and their corresponding algebraic constraint. 
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Parallelity and orthogonality relations refer to the homogeneous parts of the 
vectors: in 3D we have to test whether the (Euclidean) direction Lh of the line L 
is perpendicular resp. parallel to the plane normal A.^. In general, these relations 
can be expressed by the dot product or by the cross product of the two vectors, 
where the cross product can be written using the skew-symmetric matrix S; the 
relations are listed in table 01 
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Two things are important to note: (i) all algebraic constraints for the re- 
lations and 2D and 3D (for 2D, see table El) are bilinear with respect to each 
entity, cf. (ii) We do not have to compute every dot product relevant for a 
specific relation. It is sufficient to select as many as the degree of freedoms of the 
relation: for example, the incidence of a point X and a line L has two degrees 
of freedom, therefore we can select two rows Ajj,(L) out of r(L) and compute 
the dot product Aj(L) X = 0 and A^(L) X = 0. The selection depends on the 
coordinates Lj of line L, it is numerically safe to take the two rows with the 
largest |Tyfe|- 

Table 4. Possible orthogonality and parallelity relations T, || between observed list of 
lines Li and list of planes Ai and unknown lines M and A. The lower index h indieates 
the use of the homogeneous part of an entity, ef. table Q 
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4 Estimation of an Unknown Entity 

We now have the relations between an unknown entity /3 and a list of observation 
entities 7 i expressed as an implicit form gi(/3;7i) = 0. 

Taking this equation, we can use the Gauss-Helmert model for estimating the 
unknown /3, cf. jH| or [^ . Since the Gauss-Helmert model is a linear estimation 
model, we need the Jacobians and dg^{f3-'f^)/df3. We already 

have given them in sec. 0 because all our expressions were bilinear: in the ta- 
bles 12 131 and 0 the Jacobians ■Ji)ldf3 are given for each combination of 

unknown /3 and observation 7 JJ. 

As the estimation is iterative, we need an initial value for the unknown /3. 
Since we do not need the observations, we can minimize the algebraic distance 
O (cf. n, p. 332 & 377): 

« = y;s79. = 3T(yA.Aj),3 withA, = ?»i^ 

This leads to an eigenvector problem for the matrix the smallest 

normalized eigenvector gives the approximate value for the unknown (3. 

In sectionElwe have seen that the Jacobians Ai may not have full rank, which 
leads to a singular covariance matrix Sgg inside the Gauss-Helmert model. One 
can either compute the pseudoinverse as in or select the rows of the Jacobians, 
so that the Jacobian has the same rank number as the number of degrees of 
freedom for the relation. 



^ for equality relations, one can also use a different Jacobian as proposed in 0 
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5 Example Tasks 

We have implemented the proposed estimation as a unique generic algorithm 
with all 2D and 3D entities and all proposed onstraints. The input of the al- 
gorithm is the type of the unknown and an arbitrary set of observations . We 
will denote observations with their corresponding constraint types €,=, ||,_L as 
upper index; e.g. if an unknown point y should lie on a set of lines observation, 
we have g(y; If ) = 0 O If 9 y, denoting the observed lines If with upper index 
9, since they are incident to the point y. 

Using the same estimation algorithm, we can e.g. solve the following tasks 
among many others (the third and the fourth task will be evaluated in sec. El): 

— Corner detection in 2D. We have an unknown point y and a set of observed 
lines If for which the unknown is incident, If 9 y. The implicit functional 
model is g{y; If) = 0, the Jacobians are S(y) and S(lf ). 

— Grouping in 2D. We have an unknown line m and the following lists of 
observations: (1) incident points xf, (2) collinear (i.e. equal) lines If, (3) 
parallel lines if, (4) orthogonal lines if-. The implicit functional model is 
g(y; xf , If , if , if-) = 0. The Jacobians for i9g(m; 7j)/9m are listed in tableEl 

— Forward intersection. We want to construct a 3D point Y out of a set of 
points Xj and lines by in n images j G {1 . . .n}. Inverting the projections 
Pj, we can obtain 3D lines L(xj) and planes A(b_j), see |2|. and j2|. We 
can now estimate the unknown 3D point M: the implicit functional model 

is g(Y;L(xj), A(by)) = 0. The Jacobians used here are r(L(xj)), TT (Y), 
A(by)^ and Y^. Note that we can also construct 3D lines from sets of 2D 
points and lines by forward intersection, see also sec. 0 

— Grouping in 3D. A 3D line representing one edge of a cube is unknown, 
the observed entities are: (1) incident points Xf, (2) incident lines Lf , (3) 
collinear (i.e. equal) lines Lf , (4) parallel lines uf, (5) orthogonal lines Lf , 
(6) incident planes Af, (7) parallel planes Af and (8) orthogonal planes 
Af . All observed entities can be derived by the other 9 edges, 8 corners and 
6 surfaces of the cube. We have g(M;Xf ,Lf,L=,Lf,Lf Af , Af, Af) = 0 
and the Jacobians listed in table 0 and 0 In the same manner we can also 
construct a point representing a cube corner or a plane representing a cube 
surface. 

6 Experimental Results 

Artificial Data. To validate our algorithm, we have tested it for the previously 
described 3D grouping task in section 0 using artificial data and applying all 
proposed constraints. Artificial data provides ground-truth which is necessary 
for validating an estimation method. We estimated a 3D point, line and plane 
being part of translated and rotated cubes of size 1. For each of the 6 corners, 12 
edges and 6 surfaces of the cube, 5 observations have been generated randomly 
with a Gaussian error of 0.1% and 1%. When taking into account all possible 
constraints, the redundancy of the estimations are 57 for the unknown point, 
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176 for the unknown line and 162 for the unknown plane. We drew 1000 sample 
cubes and then estimated 1000 points, lines and planes. In fig. 0 the first row 
shows the cumulative histogram for the estimated CTq for the points Y, M and 
planes B, over all estimates we get an average of CTq = 0.988. The second row 
depicts the scatter diagram for the estimated point Y, projected onto the three 
principal planes. Furthermore, the x^~test on bias was not rejected. 

For the given datasets, all estimations converged within 3 to 4 iterations. 
The stopping criteria for the iteration was defined as follows: the corrections to 
the values of (3 should be less than 1% with respect to its standard deviation. 
The algorithm was implemented in the scripting language Perl and takes about 

1 [sec] for 50 point resp. plane observations with a factor 1.5 slower for lines. 

Real Data. We have tested the task of forward intersection of 3D lines using 
corresponding points and lines from four overlapping aerial images. Interactively 
we specified sets of 2D points and line segments which correspond to the same 
3D edges. The 2D features were automatically extracted by a polymorphic point- 
feature extraction method called FEX, cf. . FEX also gives covariance matrices 
for the extracted points and lines. The projection matrices were fixed though 
one can include statistical information here, too. For one scene we estimated 16 
3D-lines corresponding to object edges, on average we used 4 point- and 4 line 
observation for each 3D line. The endpoints of the 2D line segments defined the 
endpoints of the 3D line segments. The length of the line segments were between 

2 [m] and 12 [m], the average standard deviation at the line endpoints orthogonal 
to the 3D line segments were between 0.02 [m] and 0.16 [m], cf. fig. El 
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Fig. 2. First row: this is the cumulative histogram of the estimated ctq for the esti- 
mated point, line and plane resp. Second row: scatter diagram of the estimated point Y 
projected on the xy— , xz— , yz-planes. The inner confidence ellipse is averaged over all 
1000 estimated ellipses, the outer one is the empirical confidence ellipse based on all 
estimated points and is approx, larger by a factor 1.2 



7 Conclusions 



We proposed a generic estimation method for estimating points, lines and planes 
from identical, incident, parallel or equal points, lines and planes. It can be used 
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Fig. 3. From four aerial image patehes (left) we manually matched 2D features of 16 
roof edges (middle). The average cross error at the endpoints of estimated 3D line 
segment doesn’t exceed 0.17 [m], the length I of the line segments varied between 2 and 
14 [m], see text for details. 



in a wide variety of applications and does not need any approximate values for 
the unknowns. The implementation has been tested on a large artificial dataset 
to validate its correctness. Future work includes among other topics: (i) The 
estimation of polyhedral structures like polygons on a plane or cuboid elements, 
(ii) Additional constraints like metric constraints, e.g. fixed distance between 
entities, (iii) The estimation method could be embedded in a robust estimation 
scheme like RANSAC. 
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Abstract. Vision systems for service robotics applications have to cope with vary- 
ing environmental conditions, partial occlusions, complex backgrounds and a large 
number of distractors (clutter) present in the scene. This paper presents a new ap- 
proach targeted at such application scenarios that combines segmentation, object 
recognition, 3D localization and tracking in a seamlessly integrated fashion. The 
unifying framework is the probabilistic representation of various aspects of the 
scene. Experiments indicate that this approach is viable and gives very satisfactory 
results. 



1 Introduction 

Vision systems for service robotics applications have to cope with varying environmental 
conditions, partial occlusions, complex backgrounds and a large number of distractors 
(clutter) present in the scene. Systems mounted on mobile platforms additionally have 
to incorporate ego-motions and therefore have to solve the real-time tracking problem. 
Conventional vision systems generally perform the necessary segmentation and recog- 
nition as well as possibly 3D-localization and tracking in a pipelined, sequential way, 
with a few exceptions like m who integrate recognition and segmentation in a closed 
loop fashion. 

However, one can imagine several ways in which different parts of the image pro- 
cessing pipeline could profit from each other. Service robots for example will in most 
cases be allowed observe their environment for short time periods to take advantage of 
the information present in image streams. Moving scenes will show the objects under 
observation from changing view points and this can be exploited to maintain and improve 
existing object recognition and localization hypotheses, provided that those are tracked 
over time. In a similar way, object recognition can profit from successful segmentation 
and segmentation in turn can benefit from depth information, if available. These possible 
synergies are currently not exploited by most systems. 

This paper presents a new approach that combines segmentation, recognition, 3D- 
localization and tracking in a seamlessly integrated fashion. We aim at developing vision 
algorithms which will reliably work in real everyday environments. To reach this goal 
it is mandatory to take advantage of the synergies between the different stages of the 
conventional processing pipeline. Our approach is to eliminate the pipeline structure and 
simultaneously solve the segmentation, object recognition, 3D localization and tracking 
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problems. The unifying framework is the probabilistic representation of various aspects 
of the scene. 



2 Method 

Our method, as currently implemented and described in sect! on [TH takes advantage of 
previous work in two major areas. First of all object recognition methods based on proba- 
bilistic models of the objects appearance have recently been presented by many research 
groups (see among many others) and have shown promising results with respect 

to robustness against varying viewpoints, lighting and partial occlusions. In addition 
these models can be trained from demonstrated examples. Thus, the model acquisition 
process does not necessarily require a skilled operator, which is a nice property for the 
application of such a system in the service robotics field. 

The second major contribution onto which the current implementation of our ap- 
proach is built is the condensation algorithm by Isaard and Blake 0j]. It is used for 
the probabilistic representation and tracking of our object hypotheses (recognition and 
localization) over time. A short overview of probabilistic object recognition and the con- 
densation algorithm is given in sections ITTl and E^ As mentioned before, our goal is to 
integrate segmentation, object recognition, 3D localization and tracking in order to take 
advantage of synergies among them. The probabilistic approach to object recognition is 
the framework that enables us to do so. 



2.1 Probabilistic Object Recognition 

Probabilistic approaches have received significant attention in various domains of robo- 
tics reaching from environmental map building lO to mobile robot localization mil 
and active view planning rmnra . This is mainly due to the absolute necessity for 
explicit models of environmental uncertainty as a prerequisite for building successful 
robot systems. 

For a probabilistic recognition of an object o from a image measurement m, we are 
interested in the conditional probability p{o\m). This is called posterior probability for 
the object o given the measurement m. The optimal decision rule for deciding whether 
the object is present, is to decide based on which probability is largerp(o| m) or p{o\m) = 
1 — p{o\m) with b referring to the absence of the object. This decision rule is optimal in 
the sense of minimizing the rate of classification errors |2l. 

It is practically not feasible to fully represent p{o\m), but using the Bayes rule we 
can calculate it according to 



p{o\m) 



p{m\o)p{o) 

p(m) 



( 1 ) 



with 

- p{o) the a priori probability of the object o 

- p{m) the a priori probability of the measurement m 
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- p{m\o) the conditional probability of the measurement m given the presence of the 



These probabilities can be derived from measurement histograms computed from train- 
ing data. In case of a simple object detection problem, the decision rulep(o|m) > p{o\m) 
can be rewritten as a likelihood test. We decide that the object is present if 



Here A can be interpreted as a detection threshold, which in many cases will be set 
arbitrarily, since the a priori probabilities of the objects depend upon the environmental 
and application context and are difficult to obtain. Having k independent measurements 
nik (e.g. from a region R belonging to the object) the decision rule becomes 



If the mfc are measurement results of local appearance characteristics like color or local 
edge energy (texture), the resulting recognition systems tend to be comparatively robust 
to changes in viewing conditions H9I1 1ISI . However, this is not the main subject of this 
article. 

The appearance based object recognition approach solves only one part of our prob- 
lem. Furthermore, it theoretically requires, that the scene is properly segmented since 
equation|3assumes that all measurements mk come from the same object. A proper seg- 
mentation of complex cluttered scenes is known to be a difficult task. Depth information 
would be extremely helpful. Object boundaries generally will be easier to detect in depth 
images, but these are computationally expensive when computed over full frames. The 
depth recovery could be done more efficiently, if we already had a position hypothesis x 
for the object to be recognized at the current time step t. Section E3 will show how all 
these synergies can be exploited, without the need for extensive stereo correspondence 
search. 

2.2 A Short Introduction to the Condensation Algorithm 

The goal of using the condensation algorithm is to efficiently compute and track our 
current belief p{x, t) of the object position x, i.e. the probability that the object is at 
X at time t. While one could represent this belief distribution in a regular grid over 
the 3D-space of interest as done by Moravec |j3, it is evident that this requires huge 
amounts of memory and is computationally expensive. Additionally, in the context of 
object pose tracking most of the space is not occupied by the object(s) being tracked. 
Therefore those grid cells will have uniformly low values. Closed form (e.g. Gaussian) 
unimodal representations of the belief distribution are generally not suitable in cluttered 
scenes, since the belief will be multi-modal. Isard and Blake propose factored sampling 
to represent such non-gaussian densities. They developed the Condensation Algorithm 
for tracking belief distributions over time, based on a dynamic model of the process and 
observations from image sequences. Details of their method can be found in 0|. 



object o 
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For the context of this paper it is important, that it is an iterative method in which the 
density is represented as a weighted set {s^"^} of N samples from the state space of the 
dynamic system (in our case from the 3D space of possible position hypotheses). The 
samples are drawn according to the prior belief distribution p{x) and assigned weights 
= p{z\x = and z being an observation. The are normalized to sum up 
to 1. The subsequent distribution is predicted using a dynamic model p{x^\x^_i) . The 
weighted set represents an approximation of the posterior density p{x\z) 

which is arbitrarily close for N ^ oo. Accordingly the observation density has to be 
evaluated at X = only. Thus, only small portions of the images have to be processed. 
The number of samples N determines the computational effort required in each time 
step. 

2.3 Simultaneous Solution to the Segmentation, Object Recognition, 3D 
Localization, and Tracking Problems 

Our approach is to use Condensation for representing and tracking the belief p{x) of the 
object of interest being at x. The state space used in our implementation is the three- 
dimensional vector X describing the object position in space. The current system does 
not model object rotations. As can be seen from the short description of the Condensation 
algorithm given in sectionlElwe have to model our objects dynamics p{xf\xf_i). The 
"dynamic" model currently used is a simple three-dimensional random walk with a 
Gaussian conditional density. Of course, this will be extended in the future. Having 
specified the very simple dynamic model, we have to define the evaluation procedure of 
the observation density p{z\x), i.e. the probability for the observation z, given that the 
object is at x. 




Fig. 1. A simple stereo setup. 



We use a calibrated stereo setup. Therefore we are able to project each position 
hypothesis x into both images (see figure[I]|. A hypothesis is probable, if the local image 
patches centered at the projected points in both images satisfy two constraints: 

Consistency Constraint: First of all both patches have to depict the same portion of 
the object, i.e. look nearly the same after being appropriately warped to compensate 
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for the different viewpoints. This is implemented hy computing the normalized 
correlation of both patches. The constraint is satisfied if the correlation result exceeds 
an empirically determined threshold. The satisfaction of this constraint implicitly 
contains a range based segmentation of the scene. 

Recognition Constraint: In addition both patches should depict portions of the object 
the system is looking for. Here we use a probabilistic recognition system that cur- 
rently takes color as the local appearance feature. Our measurements used by 
equation 0 are 16 different colors obtained by an unsupervised vector quantization 
(using the K-Means algorithm on characteristic images) of the UV-sub-space of the 
YUV-color-space. The constraint is satisfied if the likelihood ratio of equation 0 is 
bigger than the detection threshold A = 150. The recognition part of our algorithm 
is thus similar to those presented in O and Q. 

If both constraints are satisfied, the observation density p(z| 3 ;) is computed as follows 

P{z\x) = Wp{jnk{di)\o) Y[p{mk{xl)\o) 

k k 

where mk{x^'') is the local appearance measurement computed at the projection xY of 
the position hypothesis x into the left and right images. The object position is computed 
as the first moment of p(z|a;), which in turn is approximated by the weighted sample 
sed 



3 Experimental Results 



The combined object recognition, localization and tracking method presented in the 
previous sections has been implemented and evaluated using real world data. The task in 
the experiments was to discover and track a known object (bottle of orange juice) under 
realistic environmental conditions (complex background, dynamic scene). Object model 
(j>{mk\o)) and background model (p(mfc|o)) were estimated from training images. The 
the position belief distribution p(a;) was initialized to be uniform inside the field of view 
up to a distance of 5m. 

Fig.Qshows that only 14 iterations are required to initially recognize and localize the 
object in a distance of 1.5m . It is obvious, that this initial phase could not be represented 
using a uni-modal Gaussian model of the belief distribution. The approximation used 
by the Condensation Algorithm is able to represent the belief with only 1000 samples. 
After the initial discovery of the object, the samples are concentrated in a very small 
portion of the search space and cover only one quarter of the image. The computation 
of the feature values has to be performed in this portion of the images only. It has to be 
noted, that there is no expensive search for stereo correspondence in the disparity space. 

The experimental setup selected for this paper ensures that the object recognition 
part of the problem can reliably be solved, the features used are "appropriate" for the 

* This is based on the assumption of a single object being tracked. 
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problem. The experiment shows, that if this is the case, the integrated approach presented 
in this paper can segment the scene as well as recognize, localize and track the object of 
interest. 



The image sequence shown in hg.|3has a duration of 1.5s after which the object 
was found and localized. On our Pentium 11/400 computer we can compute around 10 
iterations per second and our implementation still has room for significant optimizations. 




Fig. 2. The convergence of sampled belief approximation after 0, 5, 10 and 14 iterations on a static 
scene. The images show the sample (N = 1000) set projected onto left and right image and the 
floor plane. 
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After the convergence of the belief distribution, the object can reliably be tracked 
by the system. Fig. 01 depicts the trace of a tracking experiment, where the bottle was 
manually moved, (approximately) on a straight line. The images show the scene before 
the author’s hand enters it. The small dots depict the center of mass of the belief distri- 
bution, estimated from the samples at each iteration during the experiment. Even with 
our slow cycle of only IQHz the tracker locks onto the object robustly. This is due to 
the tight coupling with the object recognition, which from the tracking point of view 
provides a sharp object-background separation. 




Fig. 3. The trace of a short tracking experiment. 



4 Conclusions 

We have presented a novel and very efficient approach for a probabilistic integration 
of recognition, 3D-localization and tracking of objects in complex scenes. Experiments 
indicate that this approach is viable and gives very satisfactory results. It is important to 
note, that the single components of our system are still very simple. The probabilistic 
model does not account for spatial dependencies among the image measurements m,k 
(and thus prohibits the estimation of object rotations), the color features we use will not 
be sufficient for more complex objects and finally the underlying dynamic model of the 
condensation tracker is extremely limited. But these limitations can easily be overcome 
using more advanced and well known algorithms especially for feature extraction and 
object modeling (see for example CD for wavelet features incorporating spatial depen- 
dencies). Future work will focus on these improvements and other new features as the 
incorporation of multiple objects. In addition the system will be integrated on our mobile 
manipulation test-bed MobMan El- 
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Abstract. Existing approaches to extracting 3D point landmarks based 
on deformable models require a good model initialization to avoid local 
suboptima during model fitting. To overcome this drawback, we propose 
a generally applicable novel hybrid optimization algorithm combining 
the advantages of both conjugate gradient (cg-)optimization (known for 
its time efficiency) and genetic algorithms (exhibiting robustness against 
local suboptima). We apply our algorithm to 3D MR and CT images 
depicting tip-like and saddle-like anatomical structures such as the horns 
of the lateral ventricles in the human brain or the zygomatic bones as 
part of the skull. Experimental results demonstrate that the robustness 
of model fitting is significantly improved using hybrid optimization 
compared to a purely local eg- method. Moreover, we compare an edge 
strength- to an edge distance-based fitting measure. 

Keywords: 3D landmark extraction, deformable models, robustness, 
hybrid optimization, conjugate gradient method, genetic algorithm 

1 Introduction 

Extracting 3D point landmarks from 3D tomographic images is a prerequisite 
for landmark-based approaches to 3D image registration, which is a fundamental 
problem in computer-assisted neurosurgery. While earlier approaches exploit the 
local characteristics of the image data using differential operators (e.g., |S],ITT|), 
in 1^ an approach based on parametric deformable models has recently been pro- 
posed that takes into account more global image information and allows to lo- 
calize 3D point landmarks more accurately. However, since local optimization is 
employed for model fitting, a good model initialization is required to avoid local 
suboptima. To overcome this drawback, we propose a new, generally applicable 
hybrid optimization algorithm combining the computational efficiency of the (lo- 
cal) conjugate gradient (cg-)optimization method with the robustness of (global) 
genetic algorithms. Existing optimization algorithms for fitting parametric de- 
formable models to image data are either purely local (e.g., 0,|IS],P,|S|) or 
strictly global (e.g., a, PH). Moreover, we compare an edge strength- (e.g., USD 
with an edge distance-based fitting measure (cf., e.g., P|). We apply our fitting 
algorithm in order to extract salient surface loci (curvature extrema) of tip- and 
saddle-like structures such as the tips of the ventricular horns or the saddle 
points at the zygomatic bones (see Fig. n](a),(b)). For representing such struc- 
tures, we use globally deformed quadric surfaces (Sect.|3). The fitting measures 
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Fig. 1. (a),(b): Ventricular horns of the human brain (from JEj) and the human skull 
(from |2I). Examples of 3D point landmarks are indicated by dots. (c),(d): Quadric 
surfaces as geometric models for tips ((c): tapered and bended half-ellipsoid) and for 
saddle structures ((d): hyperboloid of one sheet). The landmark positions are indicated 
by a dot. 

for model fitting are then described in Sect. 0 while our hybrid algorithm for 
optimizing a fitting measure w.r.t. the model parameters is outlined in Sect. 21 
Experimental results of studying the robustness of model fitting are presented 
in Sect. 0 In particular, we analyze the landmark localization accuracy of our 
new approach and compare it with that of purely local cg-optimization. 

2 Modeling Tip- and Saddle-Like Structures with Quadrics 

In the literature, a variety of 3D surface models has been used for different 
applications, e.g., for segmentation, registration, and tracking (e.g., 

see 0 for a survey). To extract 3D point landmarks, we here use 
quadric surfaces as geometric models for tip- and saddle-like structures (0) 
since they well represent the anatomical structures of interest here, but only 
have a small number of model parameters. In addition, it is advantageous here 
that they may be represented by both a parametric and an implicit defining 
function. Tapered and bended ellipsoids are utilized for representing 3D tip-like 
structures such as the ventricular horns, whereas hyperboloids of one sheet are 
used for 3D saddle-like structures such as the zygomatic bones (see Fig.^c),(d)). 
For tip-like structures, the parametric form of our model is obtained by applying 
linear tapering P] and quadratic bending ^ as well as a rigid transformation, 
resp., to the parametric form of an ellipsoid: 

( Oi cos 0 cos cf>/ {px sin 0 -t-l)-t-i 5 cos 0(03 sin 0 )^ \ 

Q2 cos 0 sin sin 0 -t-l)-t -(5 sin n(o3 sin 0)^ I ( 1 ) 

03 sin 6 j 

where 0 < 0 < tt/ 2 and — tt < (j) < n are the latitude and longitude angle pa- 
rameters, resp. Further on, 01,02,03 > 0 are scaling parameters, Px,Py > 0 
denote the tapering strengths in x- and y-direction, and S, v determine the 
bending strength and direction, resp. For the rigid transformation, a, /3, 7 de- 
note the Eulerian angles of rotation and = {X, F, Z) is the translation 
vector of the origin. Hence, the model is described by the parameter vec- 
tor p = {ai,a2,a3,S,v, px, Py, X,Y, Z,a, The landmark position is then 

given by xi = Xup{'K / 2 ^Q) = cos r; o|, Jsinr; o§, 03)^ -I- t. The paramet- 

ric form of hyperboloids of one sheet is the same as the one given in p|. 
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3 Model Fitting with Edge-Based Fitting Measures 



In order to fit the geometric models from Sect. |3 to the image data, a fitting 
measure is optimized w.r.t. the model parameters. Here, we consider an edge 
strength- and an edge distanee-based fitting measure. For the edge strength-based 
measure Mes (e.g., m), M), the edge strength Cg is integrated over the model 
surface M: 

MEs{p) = -f eg{x)dF = -[[ eg{x{9,(j);p)) 

J M J J 9,4> 

( 2 ) 



dx dx 

1)9 



d9d(j) — 1 Min!, 



where eg{x) = ||V( 7 (a:)|| is the gradient magnitude of the intensity function g 
and a; is a point on the model surface M which is parameterized by 9, (j). The 
vector of model parameters is denoted by p. To emphasize small surfaces, we 
additionally apply a surfaee weighting faetor to the fitting measure (|2D which 



then takes the form Mes = — _ 



y/n. 



dF 



The edge distance-based fitting measure Med used here is written as (cf., e.g 



N 



M, 



ED 



(p) = 



p 



— ^ Min! 



( 3 ) 



2 = 1 



|VF(4,,p)||_ 

The sum is taken over all N image voxels S = which - in order to 

eliminate the influence of neighbouring structures - lie within a region-of-interest 
(ROI) and whose edge strength eg(|j) exceeds a certain threshold value. Further 
on, we use p{x) = |x| for all a: S R as a distance weighting function to reduce the 
effect of outliers (CHI)- The argument of p is a first order distance approxima- 
tion between the image voxel with coordinates and the model surface (ini), 
where F denotes the inside- outside function of the tapered and bended quadric 
surface after applying a rigid transform (cf., e.g., TFe inside- outside 

function F of an undeformed ellipsoid can be written as 



F(l) := (k/oilV ly/aalV ^ , (4) 

where ^ = {x,y,zY. Since ellipsoids have a closed surface, there is a simple 
interpretation of the inside-outside function that explains its name: 

{ If F(4) = I, ^ is on the surface, 

if F(|) > I, ^ is outside, and (5) 

if F(|) <1, ^ is inside. 

Due to inaccuracies of the first order distance approximation that is used in o, 
the edge distance-based fitting measure (0 is not suitable for hyperboloids of 
one sheet. 

Also, a volume factor is used in conjunction with 0 to emphasize small vol- 
umes. This factor has been chosen as 1-1 , where the weighting 

factor ai^est 0 . 2 ,est 0 ' 3 ,est IS coarscly estimated to a value of 10^ vox^ {vox denotes 
the spatial unit of an image voxel). For volume weighting factors (or size factors), 
see also, e.g., m- 
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4 A Novel Hybrid Optimization Algorithm 

Most optimization algorithms considered in the literature for model fitting are lo- 
cal algorithms such as the conjugate gradient (cg-)method (e.g., |I3iIIH],P)|5|)- 
The cg-method combines problem specific search directions of the method of 
steepest descent with optimality properties of the method of conjugate directions 
(e.g., 0). However, since it is a purely local method, it is prone to run into 
local suboptima. On the other hand, global optimization methods such as ge- 
netic algorithms (GAs; e.g., 0) have been proposed to avoid local suboptima 
(e.g., 0,ini), but are plagued with slow convergence rates. We here propose 
a hybrid algorithm which combines the advantages of both methods. Similar to 
GAs, we consider a whole population of parameter vectors, but we differ in the 
mutation strategy since we do not use bit-flips and crossovers to imitate natural 
mutation (0). By contrast, we use several most promising local optima resulting 
from a line search after each cg-step as candidate solutions. In order to obtain 
a generally applicable strategy, we adapt the population size to the complexity 
of the problem at hand by increasing the maximal population size each time a 
candidate solution converges to a local optimum, i.e. when its objective function 
value does not improve for a given number of cg-iterations. Gonsequently, several 
parameters can be adapted to the specific optimization problem at hand: 

• the maximum population size that must not be exceeded (here: 20), 

• the number of cg-iterations after which the least successful population mem- 
bers (measured by their value of the fitting measure) are discarded (here: 5), 

• the minimum number of population members that are retained after each 
such ’survival of the fittest’ step (here: 5), 

• the number of cg-iterations with no significant improvement of the value of 
the fitting measure after which a population member is marked convergent 
and is not subject to further cg-iterations (here: 80), and 

• a difference threshold for two parameter vectors of the deformable model 
below which they are considered as being equal. 

The mentioned parameters have been used in all our experiments. Except for the 
need of adjusting these parameters, the optimization strategy presented here is 
a general-purpose method for poorly initialized nonlinear optimization problems 
and its applicability is not confined to model fitting problems in medical image 
analysis. Only one example of hybrid optimization in image analysis is known 
to us: In jSj, discontinuity preserving visual reconstruction problems, e.g. sparse 
data surface reconstruction and image restoration problems, are described as 
coupled (binary-real) optimization problems. An informed GA is applied to the 
binary variables (describing the discontinuities), while a cg-method is applied 
to the real variables for a given configuration of the binary variables visited by 
the GA. By contrast, in our approach the local and the global part are treated 
uniformly. 

5 Experimental Results for 3D Tomographic Images 

Scope of experiments. In our experiments, the deformable models were fit- 
ted to tip-like and saddle-like anatomical structures and our hybrid optimization 
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algorithm has been compared to purely local cg-optimization w.r.t. poorly ini- 
tialized model parameters using 

• different types of image data: two 3D Tl-weighted MR images and one 
3D CT image of the human head, 

• different types of landmarks: frontal/occipital horn of the left/right lateral 
ventricle, left/right zygomatic bone as part of the skull, 

• different fitting measures: edge distance-based and edge strength-based, and 

• different sizes of the region of interest (ROI) for model fitting: 

ROI radius of 10 vox and 15 vox (vox: spatial unit of an image voxel). 

Experimental strategy. For each 3D MR and 3D CT image, an initial good 
fit is determined by repeated model fittings and visual inspection of the fitting 
results. For obtaining poor initial estimates for model fitting, the parameter 
values of the initial good fit are varied by adding Gaussian distributed random 
numbers with zero mean and large variances. In order to determine the landmark 
localization error e, the landmark positions calculated from the fitted deformable 
models are compared to ground truth positions that were manually specified in 
the 3D images in agreement with up to four persons. In addition, to measure the 
model fitting accuracy, we consider the root-mean-squared distance between edge 
points of the image and the model surface, crms, using a Euclidean distance 
map (dDI) from the image data after applying a 3D edge detection algorithm 
based on |2I. This procedure is iterated sufficiently often (here: 100 times) with 
different, randomized model initializations. The mean values and RMS estimates 
of the resulting values of e and bums are tabulated then. For evaluating the 
fitting measures 0,®, the derivatives of the intensity function are computed 
numerically using cubic B-spline interpolation and Gaussian smoothing (see pj). 

General results. Common to all experiments is that the final value of the 
fitting measure is better by about 10-50% for hybrid optimization than for purely 
local cg-optimization. In most cases, the landmark localization and the model 
fitting accuracy also improve significantly. Thus, hybrid optimization turns out 
to be superior to purely local cg-optimization at the expense of an increase in 
computational costs of a factor of 5-10 (30s-90s for local cg-optimization and 
150s-900s for hybrid optimization on a SUN SPARC Ultra 2 with 300MHz CPU). 

The edge distance-based fitting measure in m turns out to be more suitable 
for 3D MR images of the ventricular horns with high signal-to-noise ratio since 
it incorporates distanee approximations between the image data and the model 
surface (long range forces, cf. |]3). However, in comparison to o, it is relatively 
sensitive to noise. Moreover, it is not suitable for hyperboloids of one sheet due 
to inaccuracies of the first order distance approximation associated with it. 

Results for the veutricular horus. The tips of the frontal and occipital horns 
of the lateral ventricles are considered here. Typical examples of successful model 
fitting, which demonstrate the robustness of model fitting, are given in Fig. |21 
Here, contours of the model initialization are drawn in black and the results 
of model fitting using purely local cg-optimization are drawn in grey, while the 
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results of model fitting using our new hybrid optimization algorithm are drawn in 
white. The ground truth landmark positions are indicated by a ©-symbol. As can 
be seen from the averaged quantitative results in Table ^ hybrid optimization 
is superior to purely local cg-optimization and yields in most cases better model 
fitting {crms) and landmark localization (e) results (cf. also Figs. 2(a), (b)). 
Note that rather coarsely initialized model parameters have been used {cinitiai ~ 
7 ... 9 vox)^ and thus some unsuccessful fitting results - particularly in the case of 
the less pronounced occipital horns - deteriorate the average accuracy of model 
fitting as shown in Tabled 



Table 1. Fitting results averaged over 100 model fittings with randomized poor model 
initializations for 3D MR images of the frontal/occipital ventricular horns using 13 
model parameters (e: mean landmark localization error {\nvox), crms' RMS distance 
between deformable model and image data (invox), voxel size = 0.86 x 0.86 x 1.2mm®). 







Model initi- 
alization 


Edge dist.-b. htt. meas. 


Edge strength-b. fitt. meas. 


local cg-opt. 


hybrid opt. 


local cg-opt. 


hybrid opt. 


Frontal 
horn (left) 


e 

brms 


7.71 ± 3.16 
2.22 ± 1.10 


3.28 ± 2.99 
1.00 ± 0.63 


1.40 ± 1.18 
0.65 ± 0.22 


3.54 ± 2.18 
1.04 ± 0.31 


2.49 ± 2.21 
0.87 ± 0.35 


Frontal 
horn (right) 


e 

Crms 


6.57 ± 3.18 
2.12 ± 1.11 


3.87 ± 2.16 
1.05 ± 0.60 


3.15 ± 2.18 
0.78 ± 0.25 


6.55 ± 3.53 

1.56 ± 1.26 


5.19 ± 3.70 
1.28 ± 0.79 


Occipital 
horn (right) 


e 

Crms 


9.08 ± 4.42 
3.00 ± 1.40 


6.90 ± 3.89 
2.06 ± 0.93 


6.68 ± 3.93 
2.04 ± 0.87 


4.74 ± 4.33 
1.34 ± 0.87 


4.61 ± 4.31 
1.29 ± 0.78 



Results for the zygomatic bones. All results for the zygomatic bones were 
obtained with the edge strength-based fitting measure @. Model fitting for 
the saddle points at the zygomatic bones (e.g., Fig. 2(c)) in general is not as 
successful as it is for the tips of the ventricular horns since our geometric prim- 
itive does not describe the anatomical structure at hand with comparable ac- 
curacy. However, the mean landmark localization error e can be reduced from 
initially emiuai = 6.4 . . . 6.9 vox to e = 2.5 . . . 3.2 vox and the accuracy of model 
fitting is Bums = 1-5 ... 1.8 vox (voxel size = 1.0mm®). 

6 Conclusion 

In this paper, landmark extraction based on deformable models has been inves- 
tigated in order to improve the stability of model fitting as well as of landmark 
localization in the case of poorly initialized model parameters. To this end, a gen- 
erally applicable novel hybrid optimization algorithm has been introduced and 
edge strength- and edge distance-based fitting measures have been compared. Ex- 
perimental results have demonstrated the applicability of our hybrid algorithm 
as well as its increased robustness as compared to a purely local eg- method. 
However, the experimental results do not clearly favour one fitting measure. For 
the frontal ventricular horns, our edge distance-based fitting measure is more 
successful, while for the less pronounced occipital horns and for the zygomatic 
bones, the edge strength-based fitting measure is more suitable. 
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(b) 3D MR image of the occipital horn of the right lateral ventricle, edge itrength- 
based fitting measure, ROI size 15.0 vox 




(c) 3D CT image of the left zygomatic bone, edge ^fren^f/i-based fitting measure, 
ROI size 15.0 vox 

Fig. 2. Examples of successfully fitting tapered and bended half-ellipsoids to 
3D MR images of the frontal and occipital horns of the lateral ventricles (Fig. 2(a-b)) 
as well as of fitting a half-hyperboloid with no further deformations to a 3D CT image 
of the zygomatic bone (Fig. 2(c)). Contours of the model surfaces in axial, sagittal, 
and coronal planes are depicted here (from left to right). Black: model initialization, 
grey: htting resnlt for local cg-optimization, and white: fitting result for our hybrid op- 
timization algorithm. The gronnd trnth landmark positions are indicated by a ©-sign. 
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Abstract. In many branches of industry, piled box-like objects have to 
be recognized, grasped and transferred. Unfortunately, existing systems 
only deal with the most simple configurations (i.e. neatly placed boxes) 
effectively. It is known that the detection of 3D-vertices is a crucial step 
towards the solution of the problem, since they reveal essential informa- 
tion about the location of the boxes in space. In this paper we present 
a technique based on edge detection and robust line fitting, which ef- 
ficiently detects 3D-vertices. Combining this technique with the advan- 
tages of a time of flight laser sensor for data acquisition, we obtain a 
fast system which can operate in adverse environments independently of 
lighting conditions. 



1 Introduction 

This paper addresses the depalletizing problem (or bin picking problem) in the 
context of which a number of objects of arbitrary dimensions, texture and type 
must be automatically located, grasped and transferred from a pallet (a rectan- 
gular platform), on which they reside, to a specific point defined by the user. The 
need for a robust and generic automated depalletizing system stems primarily 
from the car and food industries. An automated system for depalletizing is of 
great importance because it undertakes a task that is very monotonous, stren- 
uous and sometimes quite dangerous for humans. More specifically, we address 
the construction of a depalletizer dealing with cluttered configurations of boxes 
as shown in Fig. [U 

Existing systems can be classified as follows: systems incorporating no vision 
at all and systems incorporating vision. The latter group can be further divided 
into systems based on intensity or range data. For an in depth discussion of 
existing systems the reader is referred to 0, where the superiority of systems 
employing range imagery is discussed as well. 

One of the fastest and conceptually closest systems to the one we propose, 
is the one of Chen and Kak 0. A structured light range sensor is used for data 
acquisition. A region based technique is used to segment the range image into 
surfaces. Since a completely visible vertex in the scene provides the strongest 
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Fig. 1. Intensity image 



constraints for calculating hypotheses about the position of an object in the 
scene, vertices are detected by intersecting surfaces acquired from the segmenta- 
tion. The main disadvantage of this approach is that since a vertex is computed 
as the intersection of three surfaces, the objects should expose three surfaces to 
the imaging source. In many cases this is not true, which results in a small num- 
ber of localized objects per scan. In the extreme case of neatly placed objects 
of the same height, for example, no vertices can be detected at all. A second 
disadvantage originates from the fact that, according to jS|, the region based 
segmentation methods of range data are time consuming operations. A third 
drawback, common to almost all the systems in the literature, is the usage of 
structured light techniques for range data acquisition. Since it is desirable for the 
projected light to be the only illumination source for the scene, they certainly 
perform better than camera based systems, but they still can not be used in un- 
controlled industrial environments. Furthermore, they require time-consuming 
calibration. 

This paper, in which a sub part of a depalletizing system is described, is 
organized as follows. 

Initially, we describe the scene data acquisition module, which is capable of 
operating in a variety of environmental conditions, even in complete darkness. 
The description of a fast and accurate edge detector for object border extraction 
based on scan line approximation follows. The presentation of an algorithm for 
detecting vertices in a fast and robust manner is then given. This technique 
requires only two lines for vertex detection (as opposed to three surfaces) and 
therefore provides richer information about the objects and, as will be seen, 
at a low computational cost. Finally, a paragraph summarizing the system’s 
advantages and outlining future work concludes. 



2 Data Acquisition 

The system comprises an industrial robot, namely the model KR 15/2 manu- 
factured by KUKA GmbH, a square vacuum-gripper and a time of flight laser 
sensor (model LMS200, manufactured by SICK GmbH). The output of the laser 
sensor is a set of two-dimensional points, which are defined as the intersection 
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of the objects with the sensor’s scanning plane This set of planar points 
acquired from the laser sensor will be hereinafter referred to as scan line. The 
sensor is integrated on the gripper, and the latter is seamlessly attached to the 
robot’s flange, as in j7]. In this way, we take full advantage of the flexibility for 
viewpoint selection made possible by a six degrees of freedom robot. 





(a) Range image 



(b) Edge map 



Fig. 2. Jumbled boxes 



Due to the fact that we plan the robot to grasp objects in a top-to-bottom 
approach, we perform a linear scanning of the upper part of the pallet. The 
robot is programmed to execute a linear movement, the end points of which are 
the mid points of the two opposite sides of the rectangular pallet. The scanning 
plane of the sensor is always perpendicular to the direction of the movement 
and the sensor faces the upper side of the pallet. The set of scan-lines collected 
during the movement of the sensor is our 2.5D image. 

One of the most salient problems of range imagery based on time of flight 
laser sensors, is the occurrence of noisy points caused by range or reflectance 
changes p. In order to discard these noisy points, before further processing, we 
apply the noise attenuation method described in fP to the scan-lines of our range 
image. An acquired range image with attenuated noise is depicted in Fig.|2^a). 

3 Edge Detection 

The adopted scan line approximation algorithm, proposed in [^, splits a scan 
line into segments which are later approximated by model functions. With the 
help of these functions, edge strength values are computed for each pair of end- 
points of neighboring segments. To illustrate, the edge strength value definition 
for the two segments, approximated with the functions /i(x) and f 2 (x), which 
a linear in our case (Fig. 0(a)), but could be curves in general, is defined as 

JumpEdgeStrength = \fi{x) - f 2 {x)\, 

CreaseEdgeStrength - cos"i (~/( (^)> l)(~/ 2 (^). 1)^ 
CreaseEdgeStrength ||(_//(^), l)||||(-/'(x), 1)|| ’ 
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where x = . The scan-line splitting technique 0 is illustrated in 

Fig. 0b). If we suppose that our scan line comprises the points with labels 
A-E, a linear segment is initially estimated from the end points and the maxi- 
mum distance of the line to every point of the scan line is calculated. If no point 
has a distance greater than a predetermined threshold Tapm from the approxi- 
mation curve, then the process stops for the particular scan line segment, since 
it is satisfactorily approximated. If the maximum distance is bigger than Tspiu, 
the whole process is repeated recursively (e.g. for the scan-lines AC and CE). 
This process is illustrated in Fig. 0b). The approximating functions are defined 
on the segments originating from the splitting by means of a least square fitting. 




Fig. 3. Edge detection 



The splitting algorithm is controlled by only one parameter Tspiu- Scan- 
line over-segmentation problems are solved when this threshold is increased. 
However, the arbitrary increase of Tsput produces an under-segmentation phe- 
nomenon. In order not to lose edge information, we set a low value in the thresh- 
old and we applied the fast segment-merging technique suggested in 0]j . This is 
where our detector differs from 0. In each of the fitted segments, a significance 
measure (SM) is assigned as 

SM= 

e 

where L is the length of the segment and e is the approximation error of the 
least square fitting. The significance measure is based on a pseudo-psychological 
measure of perceptual significance, since it favors long primitives, provided they 
fit reasonably well. According to the merging procedure each segment is com- 
bined sequentially with the previous, the next and both the previous and next 
segments and for each combination a SM is computed. If one of the combinations 
results in a bigger SM than the one of the candidate segment, the corresponding 
segments are merged. 

An initial version of this algorithm applied on scan- lines is presented in 0, 
where a discussion concerning the algorithm’s accuracy is also presented. The 
application of the algorithm to the range data shown in Fig. 0a) is depicted 
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in Fig. 0(b). The edge map was created by applying the detector to every row 
and column of the range image and by retaining the edge points with high jump 
edge strength values and crease edge strength values around 90 degrees. The time 
needed for the creation of the edge map in the input range image comprising 424 
rows and 183 columns was about 2.8 seconds on an Intel Pentium III, 600MHz 
processor. 

4 3D Vertex Detection 

As discussed in the introduction, the use of lines as low-level features provides 
us with rich information about the scene at a low computational cost. In the 
following section we shall demonstrate how line segments can be fitted robustly 
to the edge data. We then point out how a-priori knowledge about the objects 
can be used to obtain vertices from the line segments. 



4.1 Line Detection in 3D 

For the line fitting we have chosen the Hough transform (HT) for it’s robustness 
with respect to noisy, extraneous or incomplete data. Unfortunately, the compu- 
tational complexity and memory requirements of the standard Hough transform 
(SHT) rise exponentially with the number of parameters under detection. There- 
fore, the SHT is unsuitable for 3D-line detection involving four parameters. 

An attempt to reduce the number of parameters by applying the SHT to 
the projection of edge data to the planes defined by the coordinate axis proved 
unsuccessful due to the high line density. 

The probabilistic Hough transforms (PHTs) (cf. H) offer another approach 
to reduce the computational complexity by calculating and accumulating feature 
parameters from sampled points. While this reduces the computational complex- 
ity of the accumulation to some extent, the complexity of the peak detection and 
the memory requirements remain unchanged. 

Leavers developed the dynamic generalized Hough transform (DGHT) 
(cf. 0 ), a technique which allows for the use of one-dimensional accumulators if 
a coarse segmentation of the objects can be obtained and if the features under 
detection can be suitably parameterized. This technique, which belongs to the 
group of PHTs, reduces the memory requirements from i?" to ni?, where R is 
the resolution of the parameters space and n the number of parameters. Inspired 
by the DGHT, we have developed a technique for 3D-line fitting, which will be 
described in the remainder of this section. 

Initially, the connected components (GGs) of the projection of the edges onto 
the ground plane are detected and sorted according to their size. Line fitting is 
performed by iterating three steps: coarse segmentation of the edge points using 
the biggest GG as seed, robust line parameter estimation and removal of edge 
points in the vicinity of the robustly estimated line. In the coarse segmentation 
step a 3D-line is fitted in the least square sense to the points of the GG. Points 
within a certain distance of this line are used as input to robust parameter 
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estimation. In the robust parameter estimation step the parameters of the line 
consisting of the direction vector d and the starting point p are calculated. Pairs 
of points are sampled, their difference vectors normalized and the components 
of the normalized vectors accumulated in three one-dimensional accumulators. 
The maxima of the three accumulators yield the components of the direction 
vector n. The data points are then projected on to the plain through the origin 
whose normal vector is d. The 2D coordinates of the data points in the plane 
are then accumulated in two one-dimensional accumulators and provide us with 
the point p. 




Fig. 4. 3D-line fitting 
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Figure 0 shows the results of our robust fitting technique for two segments: 
each CC is represented with crosses, the edge points with dots, the result of the 
least square fitting with a thin line and the final result with a thick line. 

Due to the sparseness of the edge data in 3D-space, the precision of the 
lines fitted to the CC is largely sufficient for the segmentation of the edge data. 
The robustness with respect to outliers and the precision required for vertex 
detection is guaranteed by the parameter estimation with the one-dimensional 
accumulators and is clearly illustrated in Fig. 0 

4.2 Vertex Reconstruction 

3D-vertices can now be obtained from close lines forming near 90 degree angles. 
The vertices can then be used in a hypothesis generation and verification step 
(cf. the system described in 0) to accurately localize the boxes. 

5 Conclusion 

We have presented an efficient technique for the detection of 3D-vertices in range 
images. The joint use of edge detection and a technique inspired by the dynamic 
generalized Hough transform renders this technique fast and robust. This tech- 
nique will be part of a real-time depalletizer for cardboard boxes. Due to the 
fact that vertices are obtained as the intersection of two lines (as opposed to 
the intersection of three planes), a rich set of hypotheses about the location of 
the boxes is generated. This increases the probability of grasping several ob- 
jects per scan. Moreover, if the dimensions of the boxes are known it may be 
possible to achieve complete scene understanding. Another key feature of the 
depalletizer is the use of a laser sensor, which allows the system to be used in 
adverse environments and independent of the lighting conditions. 

In the future, we plan to complete the construction of the system by adding 
a box location hypothesis generation and verification step. 
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Abstract. In this paper, we present a new approach to scale-space 
which is derived from the 3D Laplace equation instead of the heat 
equation. The resulting lowpass and bandpass filters are discussed 
and they are related to the monogenic signal. As an application, we 
present a scale adaptive filtering which is used for denoising images. The 
adaptivity is based on the local energy of spherical quadrature Hlters 
and can also be used for sparse representation of images. 

Keywords: scale-space, quadrature filters, adaptive Hlters, denoising 



1 Introduction 

In this paper, we present a new approach to scale-space. In the classical case, 
the heat equation leads to a homogeneous linear scale-space which is based on 
convolutions with Gaussian lowpass filters jS!- This method has been extended 
into several directions, in order to obtain more capable methods for low level vi- 
sion. Perona and Malik introduce a diffusion constant which varies in the spatial 
domain controlling the grade of smoothing 0. In his approach, Weickert substi- 
tutes the scalar product in the controlling term by the outer product yielding a 
method for anisotropic diffusion m Sochen et. al. chose a more general point 
of view by considering the image intensity as a manifold mu where the metric 
tensor controls the diffusion. This list is far from being complete, but what is 
important in this context is that all mentioned approaches have three things in 
common. At first, they are all based on the heat equation which can be seen as 
a heuristic choice from a physical model. In m however, several approaches for 
deriving the Gaussian scale-space as the unique solution of some basic axioms are 
presented. From our point of view, there is at least one PDF besides the diffusion 
equation that also generates a linear scale-spacdl. Second, all approaches try to 
control the diffusion by some information obtained from the image which are 
mostly related to partial derivatives and a measure for edges. Therefore, struc- 
tures with even and odd symmetries are not weighted in the same way. Thirdly, 
all mentioned diffusions have theoretically infinite duration which means that 

* This work has been supported by German National Merit Foundation and by DFG 
Graduiertenkolleg No. 357 (M. Felsberg) and by DFG Grant So-320-2-2 (G. Sommer). 
^ At least our new approach seems to fulfil all axioms of lijima nni. If it really does, 
this would imply that there must be an error in the proof of lijima. 
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the diffusion has to be stopped at a specific time. The stopping of the diffusion 
process is crucial for the result. 

Our new method differs with respect to all three aspects from the classical 
ones. At first, it is based on the 3D Laplace equation which yields besides a 
smoothing kernel the monogenic signal, a 2D generalisation of the analytic signal 
0. Extending the properties of the analytic signal to 2D, the monogenic signal is 
well suited for estimating the local energy of a structure independently if it has 
an even or odd symmetry. Hence, if the smoothing is controlled by the energy, 
it is independent of the local symmetry. The energy of a local structure differs 
with the considered scale and in general it has several local maxima. These 
local maxima correspond to the partial local structures which form the whole 
considered structure. Our approach does not control the smoothing process itself 
but builds up a linear scale-space and applies an appropriate post-processing. 
Depending on the application, a coarsest scale and a minimal local energy are 
chosen, which determine an appropriate 2D surface in the 3D scale-space. 

2 Theory 

In this paper, images are considered as 2D real signals f{x,y), whereas the 
scale-space is a 3D function defined over the spatial coordinates x and y and 
the scale parameter s. The commonly used signal model of scale-space is a stack 
of images, or from the physical point of view, a heat distribution at a specific 
time. Our model is quite different and it is motivated by two points of view. 
On the one hand, ID signal processing and especially the Hilbert transform are 
closely related to complex analysis and holomorphic functions. On the other 
hand, holomorphic functions can be identified with harmonic vector fields, i.e., 
zero-divergence gradient fields. Indeed, the Hilbert transform emerges from pos- 
tulating such a field in the positive half-plane y > 0 and by considering the 
relation between the components of the field on the real line jH]. Motivated by 
this fact, we introduce the following signal model (see Fig. 0. In the 3D space 
(x, y, s), the signal is embedded as the s-component of a 3D zero-divergence gra- 
dient field in the plane s = 0. For s > 0 the s-component is given by smoothed 
versions of the original image, so that the s-component corresponds to the clas- 
sical scale-space. For s = 0 the other two components turn out to be the 2D 
equivalent of the Hilbert transform |S| and for s > 0 they are the generalised 
Hilbert transforms of the smoothed image. Hence, the most fundamental func- 
tion in this embedding is the underlying harmonic potential p, i.e., p fulfils the 
Laplace-equation (see below). Due to the embedding, the Fourier transform is 
always performed wrt. x and y only, with the corresponding frequencies u and 
V. Accordingly, convolutions are also performed wrt. x and y only. 

As already mentioned in the introduction and according to the signal model, 
we derive the non-Gaussian linear scale-space from the 3D Laplace equation 
{A^ = d1 + dy + dl indicating the 3D Laplace operator) 



A3p{x,y,s) = 0 . 



( 1 ) 
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Fig. 1. Signal model for 2D signals, the scale-space, and the generalised analytic signal. 
The signal / is given by the s-derivative of the harmonic potential p. The other two 
derivatives {/rx ~ hx* f and fay = hy* f) are the generalised Hilbert transforms of / 



The fundamental solution is given by p{x, y, s) = c{x^ + (the New- 

ton potential) [I] where c is a constant that we choose to be — (27 t)“^. Since 
derivative operators commute, the partial derivatives ofp{x,y,s) are also solu- 
tions of O (comparable to the fact that the real part and the imaginary part 
of a holomorphic function are harmonic functions). Therefore, we obtain the 
(conjugated) Poisson kernels as further solutions HD 



g{x,y,s) 



K{x,y,s) 



hy{x,y,s) 



d -1 _ s 

ds 2 tt{x'^ + 2/^ + 2tt{x'^ + 2/^ + s2)3/2 

9—1 x 

dx 2 tt{x'^ + y'^ + 2 t^{xP- + y'^ + s2j3/2 

y 

dy 2 tt{x‘^ + 2/^ + 2tt{x'^ + 2/^ + 



(2) 

( 3 ) 

( 4 ) 



By taking the limit s — >■ 0, g is the delta-impulse 5o{x)6o{y) (representing 
the identity operator) and {hx,hy)"^ are the kernels of the Riesz transforms, 
so that (g,hx,hy)^ * / is the monogenic signal of the real signal / Q. Quite 
more interesting in this context is the fact that for s > 0 (g, hj,, hy)^ * / is the 
monogenic signal of the lowpass filtered signal g * f ■ This follows immediately 
from the transfer functions corresponding to JD® {q = yjv? + v'^): 



G(u,v, s) = exp(—27rqs) (5) 

Hx{u,v, s) = iu/ q exp{—2T:qs) (6) 

Hy{u,v,s) = iv/q exp(—2Trqs) . (7) 

Besides, the Laplace equation O can also solved by separating (a;, y) and s giving 
two factors: one is the kernel of the 2D Fourier transform exp(— f27r(ttx -I- vy)) 
and the other one is G{u, v, s) see jSI . Hence, instead of forming the scalar valued 
scale-space by g{x,y,s) * f{x,y), we build a vector field f: 

f(a;, y,s) = { g{x, y, s) hx{x, y, s) ky(x, y, s) )'^ * f(x, y) 



( 8 ) 
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which is not only a family of monogenic signals (for different values of s) but also 
a solenoidal and irrotational field (which means that divergence and rotation 
are both zero), or, in terms of Geometric Algebra, a monogenic function |5] 
(monogenic: nD generalisation of holomorphic) . 



3 A New Linear Scale-Space 

Now, getting from the mathematical background to interpretation, we firstly 
consider the shape of the derived lowpass filter g and the transfer function of 
the difference of two lowpass filters with different values of s, see Fig.|21 All four 




Fig. 2. Left: impulse response of lowpass filter g (solid line) and impulse response of 
Gaussian lowpass filter (dotted line). Middle: transfer function of the difference of two 
lowpass filters for different scales s and s' (solid line) and transfer function of the 
lognormal bandpass (dotted line). Right: lowpass filters applied to a step edge 



filters are isotropic and hence, they are plotted wrt. to the radial coordinate 
r = \J and the radial frequency g, respectively. The impulse response of 
g has a sharper peak than the one of the Gaussian lowpass. On the other hand, it 
decreases slower with increasing r. Accordingly, steps in the image are changed 
according to Fig. |2| (right). While the curvature is smoother in the case of the 
filter g, the slope of the edge is more reduced by the Gaussian filter. Adopting 
the idea of the Laplacian pyramid |2I , we use the difference of two lowpass filters 
to obtain a bandpass filter. It has a zero DG component and is similar to the 
lognormal bandpass, but it has a singularity in the first derivative for g = C0- 
Hence, low frequencies are not damped as much as in the lognormal case, but 
for high frequencies, the transfer function G decreases faster. 

Gonsidering the uncertainty relation, the lowpass filter g is slightly worse 
than the Gaussian lowpass (factor y/l.b) which means that the localisation in 
phase space is not far from being optimal. In contrast to the lognormal bandpass 
filter, the spatial representation of the new bandpass filter is given analytically, 
so that it can be constructed in the spatial domain more easily. What is even 
more important, is that the linear combination of lowpass filtered monogenic 



^ Note that this is not valid for the transfer functions 0 and ( 0 . 
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signals is again a monogenic signal. Therefore, the difference of two of these 
signals is a spherical quadrature filter (see also 0). The last property of the new 
lowpass / bandpass filters which is important in the following, is the energy. The 
DC-component of the lowpass filter is one by definition ©■ The bandpass filters 
for a multi-band decomposition are constructed according to 

Bk{u,v) = exp(— 27r\/ — exp(— 27r-\/ (9) 

for A S (0; 1) which means that the maximal energy of the bandpass filter is 
independent of k. 



4 Scale Adaptive Filters 

The described scale-space approach can be used to denoise images or to reduce 
the number of features in an image (to sparsify it). Our algorithm behaves as 
follows: the scale-space is calculated up to a maximal smoothing sm which is 
sufficient to remove nearly all the noise. For each scale, the monogenic signal 
is calculated and an index function is defined by the finest scale with a local 
maximum of the local magnitude 

{ Sm if 3e > 0 : \fM{x,y,Sm)\ > \fM(x,y,s)\ Vs < + e 

A Sm < SM A \fM{x,y,SmW > Et (10) 

s M else 

where the coarsest scale sm is chosen if the maximal energy is lower than a 
threshold Et- Using this index function, a denoised image is constructed, where 
at each position the lowpass is chosen by the index. The algorithm is motivated 
by the fact that the response of the monogenic signal shows local maxima when 
it is applied in the ‘correct’ scale, where in general we have more than one 
maximum. Since it is not reasonable to consider the image on arbitrary coarse 
scales (e.g. averaging of one quarter of the image), the range of scale is reduced 
and we mostly have only one maximum, see Fig. 0 The threshold of the energy at 
the maximum suppresses weak textures by setting the index function to maximal 
smoothing. Additionally, we apply median filters to the index functions, in order 
to denoise the latter. It is not reasonable to have an arbitrarily changing scale 
index since the scale is depending on the local neighbourhood. 

In our first experiment, we applied the algorithm to the test images in Fig. El 
and Fig. El The images were both processed with the same threshold and the 
same maximal scale. In both cases, a 3x3 median filter was applied to the scale 
index obtained from (1 1 1 )l) . The dark shadows at the top and left border of the 
smoothed house-image are artifacts resulting from performing the convolutions 
in the frequency domain. Obviously, textures with low energy are removed while 
edges and other structures with high energy are preserved. If a structure is 
preserved or not depends on the choice of the parameter Et- In the case of 
noisy images, this energy threshold can be estimated according to a Rayleigh 
distribution (see 0). 
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Fig. 3. Test image (left). The energy of the monogenic signal at four positions (indi- 
cated in the left image) for varying scale can be found in the plot on the right 



In our second experiment, we added Gaussian noise to the image in Fig. 0(see 
Fig. 13 left) and applied the standard method of isotropic Gaussian diffusion to it. 
The result can be found in the middle image in Fig. |3 The typical behaviour of 
Gaussian diffusion can be seen well: near the edges the noise is still present. Our 
approach results in the image on the right in the same figure. Except for some 
singular points the noise has been removed completely. The energy threshold 
was chosen by the method mentioned in the previous paragraph. In contrast to 
the diffusion algorithm, our method tends to over-smooth slightly those region 
boundaries where the energy of the edge is lower than the energy of the noise 
(see upper left corner). 

The algorithm can also be used to sparsify images. Instead of constructing 
the denoised image from lowpass filtered versions using the scale index, the filter 
response of the monogenic signal is kept for local maxima of the energy only. 
That means that for the ‘else’ case in m the result is set to zero (nothing 
is stored). The amount of the remaining information can be estimated by the 
dark areas in the index images in Fig. 0 The representation which is obtained 
includes information of the monogenic signal only at positions with high local 
energy at an appropriate scale. 



5 Conclusion 

We have presented a new approach to scale-space which is not based on the heat 
equation but on the 3D Laplace equation. The resulting lowpass and bandpass 
filters have been shown to work properly and a relation to spherical quadrature 
filters and the monogenic signal has been established. Using the energy of the 
monogenic signal for different scales, a scale adaptive denoising algorithm has 




Fig. 4. Upper row, left to right: scale index for first test image, second test image, 
corresponding scale index, relative difference between original and smoothed version. 
Below: smoothed images 



been presented which can also be adopted for a sparse representation of fea- 
tures. The results are comparable to those of isotropic Gaussian diffusion. The 
denoising based on the Laplace equation shows less noise in the neighbourhood 
of edges, but low-energy edges are not preserved as well as in the Gaussian case. 
The new approach can easily be extended to anisotropic denoising by introduc- 
ing a 2 X 2 scaling matrix instead of the real scale parameter s. The behaviour 
for low-energy edges is supposed to be better in that case. 
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Abstract. Medial axis transform (MAT) is very sensitive to the noise, 
in the sense that, even if a shape is perturbed only slightly, the Hausdorff 
distance between the MATs of the original shape and the perturbed one 
may be large. But it turns out that MAT is stable, if we view this phe- 
nomenon with the one-sided Hausdorff distance, rather than with the 
two-sided Hausdorff distance. In this paper, we show that, if the original 
domain is weakly injective, which means that the MAT of the domain 
has no end point which is the center of an inscribed circle osculating the 
boundary at only one point, the one-sided Hausdorff distance of the orig- 
inal domain’s MAT with respect to that of the perturbed one is bounded 
linearly with the Hausdorff distance of the perturbation. We also show 
by example that the linearity of this bound cannot be achieved for the 
domains which are not weakly injective. In particular, these results apply 
to the domains with the sharp corners, which were excluded in the past. 
One consequence of these results is that we can clarify theoretically the 
notion of extracting “the essential part of the MAT” , which is the heart 
of the existing pruning methods. 



1 Introduction 

The medial axis (MA) of a plane domain is defined as the set of the centers of 
the maximal inscribed circles contained in the given domain. The medial axis 
transform (MAT) is defined as the set of all the pairs of the medial axis point 
and the radius of the corresponding inscribed circle. Because of the additional 
radius information, MAT can be used to reconstruct the original domain. More 
explicitly, the medial axis transform MAT(f?) and the medial axis MA(17) of 
a plane domain 17 is defined by MAT(17) = { (p, r) G x [0,oo) \Br{p) : 
maximal ball in 17 }, MA(17) = {p G | 3r > 0, s.t. (p, r) G MAT(17)}. 

Medial axis (transform) is one of the most widely-used tools in shape analysis. 
It has a natural definition, and has a graph structure which preserves the original 
shape homotopically But the medial axis transform has one weak point; 

It is not stable under the perturbation of the domain H2|, 0, HI See Figure El 
Even when the domain in (b) is slightly perturbed to the domain in (a) (that 
is, the Hausdorff distance between the domains in (a) and (b) is small), the 
MAT (MA) changes drastically, which results in a large value of the Hausdorff 
distance between the MATs (MAs) of the domains in (a) and (b). 
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This seemingly unplausible phenomenon can produce a lot of problems, es- 
pecially in the recognition fields, since the data representing the domains have 
inevitable noises. So there has been many attempts to reduce the complexity of 
the MAT by “pruning” out what is considered less important, or considered to 
be caused by the noise CH, CS], 0. 

One important observation that can be made from Figure 0is that the MAT 
(MA) in (b) is contained approximately in the MAT (MA) in (a). In other 
words, although the two-sided Hausdorff distance of the MATs in (a) and (b) is 
large, the one-sided Hausdorff distance of the MAT in (b) with respect to that 
in (a) is still small. 

In this paper, we analyze this phenomenon, and show that MA and MAT 
are indeed stable, if we measure the change by the one-sided Hausdorff dis- 
tance instead of the two-sided Hausdorff distance. We will show that, when a 
plane domain 17 satisfies a certain smoothness condition which we call the weak- 
injectivity, then the one-sided Hausdorff distance of MA(17) {resp., MAT(I7)) 
with respect to MA(17') {resp., MAT(17')) has an upper bound which is linear 
with the Hausdorff distances between 17, 17' and between 9l7, 917' for arbi- 
trary domain 17'. In particular, the weak-injectivity is shown to be essential for 
having the linear bound. This result extends the previous one for the injective 
domains We now can allow the sharp corners in the domains for which the 
linear one-sided stability is valid. 

It turns out that the coefficient of the linear bound grows as the angle 9q 
(See Section I2) characteristic to a weakly injective domain 17 decreases. An 
important consequence of this is that we can approximately measure the degree 
of the “detailed-ness” of a domain 17 by the value 6q. Along with this, we will 
discuss about the relation between our result and the pruning of MAT . 



2 Preliminaries 

2.1 Normal Domains 

Contrary to the common belief, MAT(17) and MA(17) may not be graphs with 
finite structure, unless the original domain 12 satisfies the following rather strong 
condition P]: 17 is compact, or equivalently, 17 is closed and bounded, and the 
boundary 917 of 17 is a (disjoint) union of finite number of simple closed curves, 
each of which in turn consists of finite number of real-analytic curve pieces. So 
we will consider only the domains satisfying this condition, which we call normal. 

Let 17 be a normal domain. Then, except for some finite number of the special 
points, the maximal ball Br{p) for every P = {p,r) G MAT(17) has exactly two 
contact points with the boundary 917. It is well known that MA(17) is a curve 
around such p in K^. See Figure CJ We will denote the set of all such generic 
points in MA(17) by G(17), and, for every p G G(17), define 0 < 9{p) < f to be 
the angle between pqj (or equivalently pq^) and MA(17) at p, where qi, q 2 are 
the two contact points. 
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MA(12) 



Fig. 1. Local geometry of MA around a generic point 



Now, for every normal domain 17, we define 9fi = inf {0{p) : p G G(l7)}. Note 
that 0 < 0J7 < f . We also define pn = min{r : (p,r) G MAT(17)}, that is, po 
is the smallest radius of the maximal balls contained in 17. 

We call an end point of MA a 1-prong point. There are exactly three kinds of 
the 1-prong points in MA, which are depicted in Figure|2 Type (a) is the center 
of a maximal circle with only one contact point at which the circle osculates the 
boundary. Type (b) is a sharp corner. Type (c) is a 1-prong point with a contact 
arc. It is easy to see that 9q = 0, if and only if MA(17) has a 1-prong point of 
the type (a), and pa = 0, if and only if MA(17) has a 1-prong point of the type 
(b). 




We call a normal domain 17 injective, ii 9 q > 0 and pa > 0, and weakly 
injective, if 9q > 0. Thus, 17 is injective, if and only if every end point of 
MA(17) is of the type (c), and it is weakly injective, if and only if MA(17) does 
not have the end points of the type (a). Note that a weakly injective domain 
may have a sharp corner {i.e., the type (b)), while an injective domain may not. 
For more details on the properties of the medial axis transform, see I2|> U, 

m- 

2.2 Hausdorff Distance: Euclidean vs. Hyperbolic 

Although sometimes it might be misleading [S|, the Hausdorff distance is a nat- 
ural device to measure the difference between two shapes. Let A and B be two 
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(compact) sets in The one-sided Hausdorff distance of A with respect to B, 
■H(A|B), is defined by HiAlB) — maxpg^ d{p, B), where d{-, •) is the usual Eu- 
clidean distance. The (two-sided) Hausdorff distance between A and B, 'H{A, B), 
is defined by T-L{A^B) = max {"H(A|i?), 'H(BIA)}. Note that, whereas the two- 
sided Hausdorff distance measures the difference between two sets, the one-sided 
Hausdorff distance measures how approximately one set is contained in another 
set. 

Though the Hausdorff distance is intuitively appealing, it cannot capture well 
the seemingly unstable behaviour of MAT under the perturbation. Recently, 
there has been the introduction of a new measure called the hyperbolic Hausdorff 
distance, so that MAT (and MA) becomes stable, if the difference between two 
MATs is measured by this measure jS) (See Proposition Q] below) . 

Let Pi = (pi, ri), P2 = {p 2 , ^2) be in x R>o, where we denote R>o = {a: G 
R I a: > 0}. Then the hyperbolic distance d/t(Pi|P2) from Pi to P2 is defined by 
dh{Pi\P2) = niax{0, d(pi,p2) — (»'2 — »'i)}- Let Mi, M2 be compact sets in R^ x 
R>o. Then the one-sided hyperbolic Hausdorff distance 'Hh{Mi\M2) of Mi with 
respect to M2 is defined by 'Hh{AIi\M2) = maxp^gMi {niinp^eMa dh{Pi\P2)}^ and 
the (two-sided) hyperbolic Hausdorff distance between Mi and M2 is defined by 

(Ml , M 2 ) = max {P,, (Ml I M 2 ) , P?. (M 2 1 Ml ) }. 

Now we have the following result which plays an important role in showing 
the main result (Theorem Q of this paper. 

Proposition 1. (p|) For any normal domains and Q 2 , we have 

max{P(f?i,f?2),P(9f2i,af22)} < -H/i(MAT(f2i), MAT(f?2)), 
P,i(MAT(f?i),MAT(l72)) < 3 • max{P(f?i, f?2),^(9l7i,af?2)}- 



3 Perturbation of Weakly Injective Domain 

We first review the previous result for the injective domains. 

Proposition 2. (Infinitesimal Pertnrbation of Injective Domain, |5J, 

( 3 ) Let fi be an injective domain. Then we have 

P(MA(f?)|MA(f2')) < ^ • e + o{e), 

1 - cos Oq 

■H(MAT(f?)|MAT(l7')) < . e -b o(e), 

1 - cos Oq 

for every e > 0 and normal domain fl' with max{'H(f7, f2'),'H(df2,df2')} < e. 

We show that, infinitesimally, the one-sided Hausdorff distance of MAT (and 
MA) of a weakly injective domain is bounded linearly by the magnitude of the 

perturbation. Define a function g : (0, tt/ 2] — >• R by g{9) = 3^1-1- . 
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Theorem 1. (Infinitesimal Perturbation of Weakly Injective Domain) 

Let f2 be a weakly injective domain. Then we have 

n{MAT{n)\MAT{n')),n{MA{n)\MA{n')) <g{9a)-e + o{e), 
for every e > 0 and normal domain fl' with max{'H(l7, < e. 

Proof. See 0- 

Example 1. Let 17 be a weakly injective domain with a sharp corner P\ de- 
picted as in Figure 01 Let 17' be the domain obtained by smoothing 17 near 
Pi so that MAT(17) = P 2 Pa- Let Pi = {pi^rf) for i = 1,2,3. Note that 
17') = P(917, ai7') = e, and P(MA(17)|MA(17')) = d(pi,p 2 ) = 

P(MAT(17)|MAT(17')) = d(Pi,P2) = • e. 




Fig. 3. One-sided stability for weakly injective domain 

This example shows that the factor in g{9n), which blows up as 

9(2 — >■ 0, is indeed unavoidable. One important consequence is that the class 
of the weakly injective domains is the largest possible class for which we have 
a linear bound for the one-sided Hausdorff distance of MAT (and MA) with 
respect to the perturbation. 

4 Illustrating Examples 

Now we will consider a few examples, and calculate explicitly the constants 9(} 
and g{9o) for each of them. 

Example 2. Consider an equilateral triangle and a star-shaped domain depicted 
respectively as in Figure0(a) and (b). Note that 0^^ = f for (a), and 9(( = ^ for 
(b). So g(9f2) = 3(1 -b 2VE) = 16.416408 . . . for (a), and g{9f2) = 43.410203 . . . 
for (b). 

Example 3. Consider the rectangular domains with the constant widths depicted 
as in Figure 0 Note that 9q = and hence g{9a) = 3 (l -b 2-\/3(l -b V^)) = 
28.089243 . . . for all cases. 
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Fig. 4. (a) Equilateral triangle; 9n = ^ and g{0n) = 16.416408 . . (b) five-sided star; 
= 1 and g{9Q) = 43.410203 . . .. 




Fig. 5. Rectangular domains with the constant widths; For all cases, we have 9u = ^ 
and g[9Q) = 28.089243.... 



5 The Essential Part of the MAT: Relation to Pruning 

Theorem^together with Example^says that the angle 6q is an important quan- 
tity reflecting the degree of the “detailed-ness” of a domain fl. The smaller 9q be- 
comes, the finer approximation, that is, the smaller max{'H(l7, 17'), 9l7')} 

is needed for MAT(17') and MA(17') of another domain 17' to contain (approx- 
imately) MAT (17) and MA(17) respectively. 

Suppose we perturb a weakly injective domain with domains which are also 
weakly injective. In this case, MAT and MA become stable under the “two- 
sided” Hausdorff distance. In particular, we have the following corollary: 

Corollary 1. (Approximation by Weakly Injective Domains) Let 17 he 

a normal domain, and let I7i and Q 2 he two weakly injective domains such that 
max f2),'H{df2i,df2)} < e for i = 1,2. Let 9 — mm{9f2i,0a2}- Then we 

have n (MAT(17i), MAT(I72)) , n (MA(17i), MA(172)) < 2g[9) ■ e + o(e) . 
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Thus, the effects on MAT and MA which arise from the choice of the weakly 
injective domains to approximate a normal domain is relatively small. So the 
MAT and the MA of an approximating weakly injective domain may be con- 
sidered as a common part among all the other approximations with the same 
da, and hence, an essential part of the original MAT and MA with the fine 
details determined by the value of 9q. This suggests that, by approximating a 
given normal domain with the weakly injective domains, it is possible to extract 
approximately the most essential part of the original MAT and MA, which is 
the main objective of the existing pruning methods. 

For example. Let Q' be the original domain as shown in Figure El (a), whose 
MA has much unilluminating parts due to its noisy boundary. We approximate 
17' by a weakly injective domain 17 shown in Figure El (b), which has relatively 
simpler MA. 







Fig. 6. Pruning MAT: (a) The original normal domain 17' with its MA; (b) 
The approximating weakly injective domain 17 with its MA. Note that the sharp 
corners are allowed; (c) The Hausdorff distance between 17 and 17', Here, e = 
max {’H(l7, 17'), 7{(dl7, 917')}; (d) Comparison of MA(17) and MA(17'). Note that 
MA(17) captures an essential part of MA(17'), while simplifying MA(17'). 



By Theorem m we can get a bound on how approximately MA(17) is con- 
tained in MA(17'), or how faithfully MA(17) approximates parts of MA(17'), 
from the constant 9q. Moreover, from Corollary ^ we see that MA(17) is the 
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essential part of MA(f?') up to the bound in Corollary^ In overall, by comput- 
ing the much simpler MA(f?), we can get the essential part (within the bounds) 
of MA(f2') without ever computing MA(f2') at all. See ^ for the computation 
of MAT for domains with the free-form boundaries. 

Of course, there still remains the problem of how to approximate/smooth 
the original noisy boundary. But we claim that, whatever method is used, our 
bounds can serve as a theoretical guarantee of the correctness of the approxima- 
tion/pruning, which is absent in most of the existing pruning schemes. 

One notable advance from |S( is that we can now allow the sharp corners 
for the approximating domain. For the discussion of using the general normal 
domains for the approximation. See |E|. 
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Abstract. For being able to automatically acquire information recor- 
ded in church registers and other historical scriptures, the text of such 
documents needs to be segmented prior to automatic reading. Segmen- 
tation of old handwritten scriptures is difficult for two main reasons. 
Lines of text in general are not straight and ascenders and descenders of 
adjacent lines interfere. The algorithms described in this paper provide 
ways to reconstruct the path of the lines of text using an approach 
of gradually constructing line segments until an unique line of text is 
formed. The method was applied to church registers. They were written 
by different people in different styles between the 17th and 19th cen- 
tury. Line segmentation was found to be successful in 95% of all samples. 

Keywords: Handwriting recognition, text line detection, document im- 
age processing 



1 Introduction 

Many historical documents that exist in libraries and various archives could be 
exploited electronically. Among those documents, church registers contain infor- 
mation on birth, marriage and death of the local population. Digital processing 
of such data requires automatic reading of this text. We present an approach for 
automatic line detection which is an important preprocessing step of this. 

Most line detection methods assume that the image of the text does not 
change much and that lines are well separated m- Church registers, however, 
were written with lines of text being close to each other, with type, size and 
shape of the handwriting changing between different registers and with text 
of a given line possibly reaching into adjoining lines. Even methods that deal 
with line segmentation of such unconstrained data |2| often require separation 
of the text with sufficient space between lines jdlSj . straightness of the lines 0 
or only a single line |S|. As none of these constraints are given in our case, other 
knowledge needs to be exploited for giving evidence of text lines. We followed 
a strategy that is independent of gaps between lines of text and straightness of 
text lines. Evidence stems from local minima defining locally straight baselines. 
Local straightness is ascertained from gradually creating larger and larger line 
segments and removing those segments that do not follow horizontal and vertical 
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Fig. 1. Part of an entry in a church register of the county of Wegenstedt in 1812 and 
the reconstructed baselines and centre lines. 



continuity constraints. The algorithm reconstructs baselines and centre lines in 
cases such as shown in Fig. ^from a chaincode representation of the skeleton 
which is derived from the text in a preprocessing stage using modules from the 
VECTORY Software 0. It automatically reconstructs the base lines and the 
centre lines 

The algorithm was tested on texts from church registers of the county of 
Wegenstedt that span a range of 300 years. 



2 Line Detection 

Four different “ledger” lines, the ascender line, the descender line, the baseline 
and the centre line bound the characters in a line of text. The baseline is the 
most pronounced line in the church registers. The location of the centre line 
can be deduced from the baseline and corrected based on evidence from the 
pen strokes. The ascender line and the descender line are not well pronounced 
because ascenders and descenders occur too infrequently and their height has a 
large variance. Thus, we restrict our search to that of the baseline and the centre 
line. 

The set of connected chaincode elements forms foreground objects which are 
called continua. The baseline is found based on the local minima of all continua 
in y-direction. They are assumed to be locally straight even though lines of 
text curve over the complete width of the page. Local minima indicate, for the 
most part, points on the baseline and on the descender line with the majority 
stemming from baseline minima. Thus, the only line stretching over the whole 
width of the page and being made up of local minima from continua that are close 
enough together and locally straight, should be the baseline. To a lesser extent, 
the same argument holds for finding the centre line based on the local maxima of 
the chaincode of the continua. Finding baseline and centre line is carried out in 
four steps: (1) Potential baseline segments (pBLSs) are found that are segments 
of straight lines through local minima of the chaincode. (2) Baseline segments 
(BLSs) are selected or constructed from the pBLSs. (3) Baselines are created 
by joining BLSs which represent the same baseline. (4^) Centre lines are created 
based on the local maxima of the chaincode and on the assumption that they 
run approximately in parallel to adjacent baselines. 
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The robustness of parameter settings that are used for the processes described 
below are reported in sectional In the following hgc stands for the average height 
of small characters (distance between baseline and centre line), Wc means the 
average width of a character. 



2.1 Detection of Potential Baseline Segments (pBLSs) 

The pBLSs are created from local minima of all continua on the page. Local 
minimum vertices are marked. A pBLS consists of a direction a and an 
ordered list of at least four vertices. Let d'^^rnax = 3-4 • Wc and = 0.2 • hsc- 

Adjacent vertices in this list must not be further apart than ^ax- None of the 
vertices may vary by more than dy from the straight line connecting these 
vertices and defined by the direction a. 

pBLSs are created independently for each and for each direction at in- 
crements of 1° within ±20° (we found this range to be sufficient). New vertices 
are added that lie in direction a constrained by the above-mentioned dis- 
tance and deviation tolerances. The search for a pBLS terminates when no new 
vertices can be added. 

2.2 Creating Baseline Segments (BLSs) 

After finding all possible pBLSs, each vertex may belong to more than one pBLS. 
First, pBLS are excluded that devide by more than ±7° from the main direction 
of all pBLS. This direction is estimated from the histogram of directions of all 
pBLS. The threshold of ±7° was found experimentally. 

The remaining set of pBLSs still contains wrong segments. The next step 
creates a subset of baseline segments (BLSs) from the set of pBLSs. BLSs are 
selected according to the following rules: 

— The number of strokes above the BLS must be larger than that below it. 

— The BLS must not be completely contained in another pBLS in an adjacent 
direction with a smaller vertical deviation of the included vertices. 

— The BLS must not be intersected by a longer pBLS that includes it in hori- 
zontal direction. 

A special case arises if two pBLS cover the same part of the baseline but with 
conflicting interpretations. This is the case if the vertex lists pi and p 2 of two 
crossing pBLSs exist with pi = ■ ■ ■ , Umin, • ■ ■ , ■ ■ • , and p 2 = 

{<in, . . . , <i„, . . . , , <;„} (shown in Fig. 0 ). 

Such line segments are separated into three subsets. The middle part is the 
set of vertices . . . , that is contained in pi as well as in p 2 - One subset 
on the left side and one subset on the right side is chosen. In order to come 
to a decision between the subsets . . . , and . . . , and between 
the subsets . . . , and ■ ■ ■ , "Cmin) fhe one is chosen that contains the 
larger number of vertices. 
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Fig. 2. Solving a special case of crossing BLSs. 



2.3 Creating Baselines 

Elements of the set of BLSs are joined in order to form baselines. The process 
starts with the leftmost and uppermost BLS that is not part of a baseline and 
attempts to create a baseline by adding the next BLS. Let = 7 ■ Wc and 

dy = 1.6- hsc- The joint is possible if the next BLS is with its leftmost vertex 
not further away than d^ from the right most vertex of the current BLS and 
if the vertical distance difference is less than dy ,^^^. The process proceeds until 
no more BLS can be added. It is repeated for new baselines until no BLS exists 
that is not part of a baseline. 

It may happen that false baselines are created from combining local minima, 
lying off the baseline, from artefacts caused by ascenders and descenders and 
even true local minima at the baseline (the latter because a tolerance of ±7° is 
still large) . For the removal of false baselines a quality measure is computed from 
total length of the baseline Itotai and the percentage of strokes Sabove immediately 
above the baseline as compared to all strokes as qu — hotai'Sabove- If two potential 
baselines intersect, the one baseline with lower qu value will be deleted. 

Wrong baselines may also be located between lines of text. They are generally 
shorter than true baselines. They may be detected because of these facts. Let lav 
be the average length of all baselines. The average vertical distance dy between 
adjacent base lines is computed from all baselines with Itotai < 0-8-lav The latter 
excludes wrong baselines from the computation. Let dpr^y be the distance of a 
baseline to the previous one, dnext the distance to the next one. The baselines 
with min{dprevi dnext) < 0.6 • day will be removed. 

2.4 Computing the Centre Line 

The centre line could be computed in a similar fashion from local maxima of the 
skeleton of the continua as the baseline, but the local maxima give a much weaker 
support for the line segments. However, the path of the baseline is known. Given 
the assumption that the centre line is approximately parallel to the baseline and 
given the assumption that the distance between centre line and baseline is less 
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than 50% of the distance between two adjacent baselines, the centre line can be 
reconstructed. 

Based on the position and direction of a BLS, a centre line segment (CLS) is 
searched by calculating the average vertical offset of the local maximum vertices 
being situated above the BLS and below the previous baseline (see Fig. 0. The 
horizontal distance constraint is not used because the existence of this CLS is 
known and only its position is searched. 




Fig. 3. Chain-code of writing, BLS (solid line), CLS (dashed line) with tolerance space 
(dashed rectangle) for the maxima (crosses). 



3 Results 

Line detection for church chronicles needs to be robust because even on a single 
page the style of the script may vary due to the time range of several days 
during which the page was written. This aim is achieved by the utilization of 
redundant information in the geometrical arrangement. For example a BLS can 
be reconstructed even if not all involved local minima are included or a baseline 
can be built even if one BLS is missing. The detection algorithm should perform 
well without requiring parameter adaptation unless the script differs widely. 
Parameters should only depend on features that can be easy deduced from the 
script. 

Parameters depending on Wc or hgc were adapted to three different scripts 
from 1649, 1727 and 1812 showing in Fig. 0 We found the optimal value by 
changing a parameter step by step and counting the resulting errors (see Fig. 5 
(a)). By calculating factors from these values and the basic features of the script, 
we have the possibility to adapt the parameters subsection 

EUD and (see subsection E3 to another script. 

The robustness of the algorithm can be seen from the wide range in which pa- 
rameters may vary without significant increase of error (shown, e.g., for 
in Fig. 5 (a); results for and were similar). Other param- 

eters like the range of directions during detection of pBLS affects the results 
scarcely and are set constant. 

Robustness was tested by deliberately varying Wc and /iscfrom their optimal 
setting and recording the error. Results can be seen in Fig. El 
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hsc = 2.0 mm 
Wc = 2.4 mm 




hac = 2.6 mm 
Wc = 4.5 mm 




Wc = 3.0 mm 



Fig. 4. Samples of script from 1649, 1727 and 1812 and the basic parameters. 
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(a) IX'poiKlcrico of error rate from t iie 
parameter 




(b) Example of an error in base- 
line reconstruction. 



Fig. 5. Example of a test result using script from 1812 and a example of an error. 



The error rate was defined as the percentage of words with wrong or without 
reconstructed base or centre lines (see example in Fig. 5 (b)). 

We tested entries in church registers with overall 1161 words and five different 
handwritings (from 1649, 1727, 1812, 1817 and 1838). The error rates of all 
handwritings are between 3 and 5 percent; only the script from 1727 causes a 
error rate of 9 percent because there are many interferences like underlines and 
single words between the lines of text. Nonetheless the total result is 49 errors 
in 1161 words (4.2%). 



4 Conclusions 

We presented a robust approach for detection of baselines and centre lines of text 
which are the most important lines for an automatic text reading. Thereby, the 
procedure enables line recognition under the complex conditions caused by the 
style of script. It is done by gradually building up evidence for text lines based 
on local minima. It requires only an approximate adaptation of parameters to 
basic features of script. Thus, the algorithm will find textlines with high accuracy 
without additional adaptations although the style of writing may change even 
on a single page. 
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Fig. 6. Dependence of error rate of the two script features hsc and Wc was tested on 
a paragraph from 1812. The Location of the estimated values for hsc and Wc is shown 
as ’x’. 



Further improvement could come from a previous analysis of distance values 
of local extrema in oder to get information about the script features. Serious 
interferences could be eliminated if underlines are recognized. 

However, a 100% error-free reconstruction without preceding recognition of 
text itself will be impossible. This is the case because local straightness, hori- 
zontal continuity of lines and vertical continuity of distances between lines must 
allow for some variation. Cases can be constructed where this variation will lead 
to wrong selection of line segments. 
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Abstract. Novel similarity measures for object recognition and image matching 
are proposed, which are inherently robust against occlusion, clutter, and nonlinear 
illumination changes. They can be extended to be robust to global as well as 
local contrast reversals. The similarity measures are based on representing the 
model of the object to be found and the image in which the model should be 
found as a set of points and associated direction vectors. They are used in an 
object recognition system for industrial inspection that recognizes objects under 
Euclidean transformations in real time. 



1 Introduction 

Object recognition is used in many computer vision applications. It is particularly useful 
for industrial inspection tasks, where often an image of an object must be aligned with 
a model of the object. The transformation (pose) obtained by the object recognition 
process can be used for various tasks, e.g., pick and place operations or quality control. 
In most cases, the model of the object is generated from an image of the object. This 2D 
approach is taken because it usually is too costly or time consuming to create a more 
complicated model, e.g., a 3D CAD model. Therefore, in industrial inspection tasks one 
is usually interested in matching a 2D model of an object to the image. The object may be 
transformed by a certain class of transformations, depending on the particular setup, e.g., 
translations. Euclidean transformations, similarity transformations, or general 2D affine 
transformations (which are usually taken as an approximation to the true perspective 
transformations an object may undergo). 

Several methods have been proposed to recognize objects in images by matching 2D 
models to images. A survey of matching approaches is given in Q. In most 2D match- 
ing approaches the model is systematically compared to the image using all allowable 
degrees of freedom of the chosen class of transformations. The comparison is based on 
a suitable similarity measure (also called match metric). The maxima or minima of the 
similarity measure are used to decide whether an object is present in the image and to 
determine its pose. To speed up the recognition process, the search is usually done in a 
coarse-to-fine manner, e.g., by using image pyramids na. 

The simplest class of object recognition methods is based on the gray values of the 
model and image itself and uses normalized cross correlation or the sum of squared 
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or absolute differences as a similarity measure Q. Normalized cross correlation is 
invariant to linear brightness changes but is very sensitive to clutter and occlusion as 
well as nonlinear contrast changes. The sum of gray value differences is not robust to 
any of these changes, but can be made robust to linear brightness changes by explicitly 
incorporating them into the similarity measure, and to a moderate amount of occlusion 
and clutter by computing the similarity measure in a statistically robust manner p6|. 

A more complex class of object recognition methods does not use the gray values of 
the model or object itself, but uses the object’s edges for matching In all existing 
approaches, the edges are segmented, i.e., a binary image is computed for both the 
model and the search image. Usually, the edge pixels are defined as the pixels in the 
image where the magnitude of the gradient is maximum in the direction of the gradient. 
Various similarity measures can then be used to compare the model to the image. The 
similarity measure in Q computes the average distance of the model edges and the image 
edges. The disadvantage of this similarity measure is that it is not robust to occlusions 
because the distance to the nearest edge increases significantly if some of the edges of 
the model are missing in the image. 

The Hausdorff distance similarity measure used in @ tries to remedy this shortcom- 
ing by calculating the maximum of the /c-th largest distance of the model edges to the 
image edges and the Z-th largest distance of the image edges and the model edges. If the 
model contains n points and the image contains m edge points, the similarity measure 
is robust to 100fc/n% occlusion and 100Z/m% clutter. Unfortunately, an estimate for m 
is needed to determine I, which is usually not available. 

All of these similarity measures have the disadvantage that they do not take the 
direction of the edges into account. In El it is shown that disregarding the edge direction 
information leads to false positive instances of the model in the image. The similarity 
measure proposed in El Ties to improve this by modifying the Hausdorff distance to 
also measure the angle difference between the model and image edges. Unfortunately, 
the implementation is based on multiple distance transformations, which makes the 
algorithm too computationally expensive for industrial inspection. 

Finally, another class of edge based object recognition algorithms is based on the 
generalized Hough transform [Ql- Approaches of this kind have the advantage that they 
are robust to occlusion as well as clutter. Unfortunately, the GHT requires extremely 
accurate estimates for the edge directions or a complex and expensive processing scheme, 
e.g., smoothing the accumulator space, to determine whether an object is present and 
to determine its pose. This problem is especially grave for large models. The required 
accuracy is usually not obtainable, even in low noise images, because the discretization 
of the image leads to edge direction errors that already are too large for the GHT. 

In all of the above approaches, the edge image is binarized. This makes the object 
recognition algorithm invariant only against a narrow range of illumination changes. 
If the image contrast is lowered, progressively fewer edge points will be segmented, 
which has the same effects as progressively larger occlusion. The similarity measures 
proposed in this paper overcome all of the above problems and result in an object 
recognition strategy robust against occlusion, clutter, nonlinear illumination changes, 
and a relatively large amount of defocusing. They can be extended to be robust to global 
as well as local contrast reversals. 
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2 Similarity Measures 



The model of an object consists of a set of points pi = {xi, yif" with a corresponding 
direction vector di = {ti, Ui) , i = 1, . . . ,n. The direction vectors can be generated by 
a number of different image processing operations, e.g., edge, line, or corner extraction, 
as discussed in Section^ Typically, the model is generated from an image of the object, 
where an arbitrary region of interest (ROI) specifies that part of the image in which the 
object is located. It is advantageous to specify the coordinates pi relative to the center 
of gravity of the ROI of the model or to the center of gravity of the points of the model. 

The image in which the model should be found can be transformed into a represen- 
tation in which a direction vector Cx,y = {vx,y, Wx,y)’^ is obtained for each image point 
{x, y). In the matching process, a transformed model must be compared to the image at 
a particular location. In the most general case considered here, the transformation is an 
arbitrary affine transformation. It is useful to separate the translation part of the affine 
transformation from the linear part. Therefore, a linearly transformed model is given by 
the points p' = Api and the accordingly transformed direction vectors d[ = Adi, where 



^ ^ /an onN 
~ V 021 022 / 

As discussed above, the similarity measure by which the transformed model is com- 
pared to the image must be robust to occlusions, clutter, and illumination changes. One 
such measure is to sum the (unnormalized) dot product of the direction vectors of the 
transformed model and the image over all points of the model to compute a matching 
score at a particular point q = {x, y)^ of the image, i.e., the similarity measure of the 
transformed model at the point q, which corresponds to the translation part of the affine 
transformation, is computed as follows: 



^ ri ^ u 

s ~ ^ ^q+p') ~ ^ ' tAx+x^,y+y'^ T '^i'Wx+x^,y+y'^ 

‘ i=l i=l 



( 1 ) 



If the model is generated by edge or line hltering, and the image is preprocessed in the 
same manner, this similarity measure fulhlls the requirements of robustness to occlusion 
and clutter. If parts of the object are missing in the image, there are no lines or edges 
at the corresponding positions of the model in the image, i.e., the direction vectors will 
have a small length and hence contribute little to the sum. Likewise, if there are clutter 
lines or edges in the image, there will either be no point in the model at the clutter 
position or it will have a small length, which means it will contribute little to the sum. 

The similarity measure m is not truly invariant against illumination changes, how- 
ever, since usually the length of the direction vectors depends on the brightness of the 
image, e.g., if edge detection is used to extract the direction vectors. However, if a user 
specifies a threshold on the similarity measure to determine whether the model is present 
in the image, a similarity measure with a well defined range of values is desirable. The 
following similarity measure achieves this goal: 
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Because of the normalization of the direction vectors, this similarity measure is addi- 
tionally invariant to arbitrary illumination changes since all vectors are scaled to a length 
of 1 . What makes this measure robust against occlusion and clutter is the fact that if a 
feature is missing, either in the model or in the image, noise will lead to random direction 
vectors, which, on average, will contribute nothing to the sum. 

The similarity measure © will return a high score if all the direction vectors of the 
model and the image align, i.e., point in the same direction. If edges are used to generate 
the model and image vectors, this means that the model and image must have the same 
contrast direction for each edge. Sometimes it is desirable to be able to detect the object 
even if its contrast is reversed. This is achieved by; 



In rare circumstances, it might be necessary to ignore even local contrast changes. 
In this case, the similarity measure can be modified as follows: 



The above three normalized similarity measures are robust to occlusion in the sense 
that the object will be found if it is occluded. As mentioned above, this results from 
the fact that the missing object points in the instance of the model in the image will on 
average contribute nothing to the sum. For any particular instance of the model in the 
image, this may not be true, e.g., because the noise in the image is not uncorrelated. 
This leads to the undesired fact that the instance of the model will be found in different 
poses in different images, even if the model does not move in the images, because in 
a particular image of the model the random direction vectors will contribute slightly 
different amounts to the sum, and hence the maximum of the similarity measure will 
change randomly. To make the localization of the model more precise, it is useful to set 
the contribution of direction vectors caused by noise in the image to zero. The easiest 
way to do this is to set all inverse lengths l/||eg_)_p' || of the direction vectors in the image 
to 0 if their length ||eg+p/ 1| is smaller than a threshold that depends on the noise level in 
the image and the preprocessing operation that is used to extract the direction vectors in 
the image. This threshold can be specified easily by the user. By this modification of the 
similarity measure, it can be ensured that an occluded instance of the model will always 
be found in the same pose if it does not move in the images. 

The normalized similarity measures have the property that they return a 

number smaller than 1 as the score of a potential match. In all cases, a score of 1 
indicates a perfect match between the model and the image. Furthermore, the score 
roughly corresponds to the portion of the model that is visible in the image. For example, 
if the object is 50% occluded, the score (on average) cannot exceed 0.5. This is a highly 
desirable property because it gives the user the means to select an intuitive threshold for 
when an object should be considered as recognized. 

A desirable feature of the above similarity measures 0-® is that they do not need 
to be evaluated completely when object recognition is based on a threshold Smin for the 




(3) 




n 
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similarity measure that a potential match must achieve. Let Sj denote the partial sum of 
the dot products up to the j-th element of the model. For the match metric that uses the 
sum of the normalized dot products, this is: 



Obviously, all the remaining terms of the sum are all < 1 . Therefore, the partial score can 
never achieve the required score Smin if Sj < Smin ~ 1 + j /n, and hence the evaluation 
of the sum can be discontinued after the j-th element whenever this condition is fulfilled. 
This criterion speeds up the recognition process considerably. 

3 Object Recognition 

The above similarity measures are applied in an object recognition system for industrial 
inspection that recognizes objects under Euclidean transformations, i.e., translation and 
rotation, in real time. Although only Euclidean transformations are implemented at the 
moment, extensions to similarity or general affine transformations are not difficult to 
implement. The system consists of two modules: an offline generation of the model and 
an online recognition. 

The model is generated from an image of the object to be recognized. An arbitrary 
region of interest specifies the object’s location in the image. Usually, the ROI is specified 
by the user. Alternatively, it can be generated by suitable segmentation techniques. To 
speed up the recognition process, the model is generated in multiple resolution levels, 
which are constructed by building an image pyramid from the original image. The number 
of pyramid levels Z^iax is chosen by the user. 

Each resolution level consists of all possible rotations of the model, where thresholds 
4>niin and (jiniax for the angle are selected by the user. The step length for the discretization 
of the possible angles can either be done automatically by a method similar to the one 
described in [i2| or be set by the user. In higher pyramid levels, the step length for the 
angle is computed by doubling the step length of the next lower pyramid level. 

The rotated models are generated by rotating the original image of the current pyra- 
mid level and performing the feature extraction in the rotated image. This is done because 
the feature extractors may be anisotropic, i.e., the extracted direction vectors may depend 
on the orientation of the feature in the image in a biased manner. If it is known that the 
feature extractor is isotropic, the rotated models may be generated by performing the 
feature extraction only once per pyramid level and transforming the resulting points and 
direction vectors. 

The feature extraction can be done by a number of different image processing algo- 
rithms that return a direction vector for each image point. One such class of algorithms 
are edge detectors, e.g, the Sobel or Canny 0 operators. Another useful class of algo- 
rithms are line detectors 0. Finally, corner detectors that return a direction vector, e.g., 
na, could also be used. Because of runtime considerations the Sobel filter is used in 
the current implementation of the object recognition system. Since in industrial inspec- 
tion the lighting can be controlled, noise does not pose a significant problem in these 
applications. 
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To recognize the model, an image pyramid is constructed for the image in which 
the model should he found. For each level of the pyramid, the same hltering operation 
that was used to generate the model, e.g., Sohel filtering, is applied to the image. This 
returns a direction vector for each image point. Note that the image is not segmented, 
i.e., thresholding or other operations are not performed. This results in true robustness 
to illumination changes. 

To identify potential matches, an exhaustive search is performed for the top level 
of the pyramid, i.e., all precomputed models of the top level of the model resolution 
hierarchy are used to compute the similarity measure via Q), ©, or dJ for all possible 
poses of the model. A potential match must have a score larger than a user-specihed 
threshold Smin and the corresponding score must be a local maximum with respect to 
neighboring scores. As described in Section|2 the threshold Smin is used to speed up the 
search by terminating the evaluation of the similarity measure as early as possible. 

After the potential matches have been identified, they are tracked through the reso- 
lution hierarchy until they are found at the lowest level of the image pyramid. Various 
search strategies like depth-first, best-hrst, etc., have been examined. It turned out that a 
breadth-hrst strategy is preferable for various reasons, most notably because a heuristic 
for a best-first strategy is hard to define, and because depth-first search results in slower 
execution if all matches should be found. 

Once the object has been recognized on the lowest level of the image pyramid, its 
position and rotation are extracted to a resolution better than the discretization of the 
search space, i.e., the translation is extracted with subpixel precision and the angles 
with a resolution better than the angle step length. This is done by htting a second 
order polynomial (in the three pose variables) to the similarity measure values in a 
3x3x3 neighborhood around the maximum score. The coefficients of the polynomial 
are obtained by convolution with 3D facet model masks. The corresponding 2D masks 
are given in 0 . They generalize to arbitrary dimensions in a straightforward manner. 



4 Examples 

FigurelDdisplays an example of recognizing multiple objects. To illustrate the robustness 
against nonlinear illumination changes, the model image in Figure QJ a) was acquired 
using back lighting. Figure EUb) shows that all three cog wheels have been recognized 
correctly despite the fact that front lighting is used and that a fourth cog wheel occludes 
two of the other cog wheels. 



5 Conclusions 

A new class of similarity measures for object recognition and image matching, which are 
inherently robust against occlusion, clutter, nonlinear illumination changes, and global 
as well as local contrast reversals, have been proposed. The similarity measures are used 
in an object recognition system for industrial inspection that is able to recognize objects 
under Euclidean transformations in video frame rate. The system is able to achieve an 
accuracy of 1/22 pixel and 1/12 degree on real images. 
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(a) Image of model object (b) Found objects 



Fig. 1. Example of recognizing multiple objects. To illustrate the robustness against illumination 
changes, the model image uses back lighting while the search image uses front lighting. 



Future work will focus on extending the object recognition system to handle at least 
similarity transformations and possibly general affine transformations. 
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Abstract. This paper presents a general solution for the problem of 
affine point pattern matching (APPM). Formally, given two sets of two- 
dimensional points {x,y) which are related by a general affine transfor- 
mation (up to small deviations of the point coordinates and maybe some 
additional outliers). Then we can determine the six parameters aik of the 
transformation using new Hu point-invariants which are invariant with 
respect to affine transformations. With these invariants we compute a 
weighted point reference list. The affine parameters can be calculated 
using the method of the least absolute differences (LAD method) and 
using linear programming. In comparison to the least squares method, 
our approach is very robust against noise and outliers. The algorithm 
works in 0{n) average time and can be used for translation and/or ro- 
tations, isotropic and non-isotropic scalings, shear transformations and 
reflections. 



1 Introduction 



Point pattern matching (PPM) is an important problem in image processing. 
Cox and Jager have given already in 1993 an overview with respect to different 
types of transformations and methods 0. 

The here presented method describes a general solution for the problem of 
affine point pattern matching in 0{n) time where the point attachment will be 
performed in the space of the so called Hu invariants, see HD. In the original 
literature the Hu invariants are only invariant with respect to similarity transfor- 
mations. In the present paper we develop Hu invariants which are even affinely 
invariant using the ideas of normalization. With the help of these invariants, a 
list of weighted point references can be calculated. Following an idea of Ben-Ezra 
0 the affine transformation can be calculated by the least absolute differences 
(so called LAD method) using linear programming. The algorithm is numerically 
stable against noise of the points and outliers. 
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2 AfRne Invariants 



2.1 AfRne NormaRzation of Discrete Point Sets 



Formally, given an affine transformation and two discrete sets of points. We use 
the method of normalization well known in the theory of affine invariants for 
planar objects, see (Enni)- As features we use the discrete moments instead of 
the area moments. Therefore, at the beginning we define the discrete moments 
Mj^k of a finite point set P. Here is Mo,o = n the number of points, and (x, y) is 
the center of gravity for the given point set. Now the central moments m'- ^ are 



_ _ 'j,k 

given by a translation x' = x — x and y' = y — y, so that we have m'l q = 0 
and rrip = 0 in the first normalization step. In the second step we perform with 
x" = x' + ay' and y" = y' a shearing (or “stretching”) in x direction. By 
this, the old central moments m' ^ are transformed in new moments We 

obtain especially for m'{ i with the simple requirement m!{ i = 0, the parameter 
a is determined as a = —m'l i/toq 2 . Now we can calculate all new moments 



m'j k with given normalization values for (w"q, Wq m" = (0,0,0). 



Finally, a general anisotrope scaling x" 



= /3x" and y'" = yy" yields 



the moments 



The moments and m\ 



that the parameters f 3 and 7 have to be /3 = 



p'2 shall be normalized to 1 so 
-j— O' = ^ 

rn" ’ ' m" 



Note, that 



there are differences between these expressions and the expressions using the 
normalization of the area moments, see HMT]. 

Now, the five lowest order moments are normalized to the canonical values: 

(m'/(o, m'p"i , , m' 2 "p, = (0, 0, 0, 1, 1) . 



2.2 Hu Invariants with Respect to AfRne Transformations 

A stable normalization of the rotation is not posssible. For that reason, we 
try to find numerically stable expressions which are invariant against rotations. 
Such numerically stable expressions are introduced by Hu (see El) . Hu has been 
derived 7 invariants Hi, H2 , ..., Hj including one to third order moments. For our 
purpose we need additional Hu invariants including up to fourth order moments. 
These invariants can be easily derived using the so called complex moments, 
see Eirra . Some of the classical Hu invariants and the new Hu invariants are 
given in table 1. All other Hu invariants up to order 4 vanish because of our 
normalizations m^p = TOp2 = 1 and m"( = 0. Note, that the so called algebraic 
moment-invariants (see 0) are numerically too sensitive against noise because 
there are values of a power of the moments until 11. 



2.3 Hu Invariants of Single Points 

We put the origin of the coordinate system in each of the points p £ P and 
compute the affine Hu invariants Hk(p) now depending on this point p. That 
means, we needn’t to normalize the moments to the centroid because p is itself 
the origin. It follows, that we receive for every point p affine Hu invariants. Now, 
we can search for any point p G P the corresponding point q G Q with the same 
(or at least similar) Hu invariants. 
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Table 1. Affine Hu invariants np to 4-th order moments 



order 


Hu invariant 


2 H2 


Ul20 + ^02 — 2 


3 H3 


(m.30 — Srriig) + (Sm'gi — m'gg) 


CO 




4 Hs 




4 Hg 


(rn.4Q -|- ?tiq4) -|- 16 irn'gi — 1x1(3) ~ 12 m .22 (uiq” ~ iixi'ig + xn'lo) 


4 Hii 


^40 27Ti22 ^04 



Table 2. Affine discrete Hu invariants Hk{p) for every point from a point set of 10 
points 



point p 


H3(p) 


HYp) 


Hs{p) 


HYp) 


Hiiip) 


(164,320) 


0.014 


0.645 


0.084 


0.134 


0.755 


(422,314) 


0.127 


0.531 


0.042 


0.104 


0.609 


(150,425) 


0.232 


0.687 


0.020 


0.182 


0.732 


(89,75) 


0.345 


0.544 


0.081 


0.123 


0.670 


(242,398) 


0.324 


0.855 


0.033 


0.505 


0.897 


(324,434) 


0.288 


0.776 


0.049 


0.399 


0.851 


(91,189) 


0.334 


0.609 


0.117 


0.200 


0.721 


(339,417) 


0.184 


0.739 


0.045 


0.297 


0.834 


(263,355) 


0.242 


0.693 


0.031 


0.481 


0.877 


(448,112) 


0.323 


0.483 


0.037 


0.107 


0.561 



3 Determination of the AfRne Transformations 

3.1 Next Neighbor Search in the Hu Space 

First of all we have to calculate in the space of all Hu point invariants a list of 
reference points {pt = (xi,yi),qi = For a point pi € P 

it is to find a reference point qi G Q with a weight or probability Wi. In our 
experiments we have used a simple nearest neighbor classificator to find a cor- 
responding point. The weight Wi can be calculated using the classificator, where 
there is an unique correspondence or not and can depending on the distance of 
the corresponding points. There are some heuristics in finding a good measure 
for the weights. 

In Table El it can be seen the numerical range of the affine Hu invariants of 
a point set. 

3.2 Estimators for the AfRne Parameters 

The next step is now the calculating of the affine transformation from the 
weighted correspondence list. This can be done in a common way by the least 
squares estimator using the L 2 -nietric. An advantage of this estimator is that 
we have to solve only a linear system of 6 equations with 6 variables. But it is 
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also well known that the least squares method is very sensitive to outliers. That 
is the reason why we need a good initial guess for an iterative reweighted least- 
squares algorithm. It is also well-known that the Li-metric is much more robust 
than the L2-metric. For example, if we have the two one-dimensional problems 
L2 = ~ mini and Li = I®* ~v\^ mini then the median 

V minimizes the L\ metric, and the centroid /r minimizes the L2 metric. 

Because of that we are using instead of the least squares method (L2-nietric) 
the method of the least absolute differences (LAD, Li-metric): 

n 

Li = '^Wi\x'^-aiiXi-ai2yi-ai3\-\-Wi\y'i-a2iXi-a22yi-a23\ minimuml 

The next question is to find a good numerical procedure for solving this problem. 
It is not possible to find the solution from VLi = 0, but if we use the relation 
\x\ = it is possible to form the gradient in any point excluding x = 0. 

Starting with an arbitrary initial value for the 6 affine parameters, a sequence 
for the parameters can be calculated using the gradient method. It is simple to 
show that this sequence converges towards the global minimum. We have only a 
6-dimensional optimization problem, but this problem is nonlinear and difficult 
to handle. An another very good idea is to form a linear programming problem 
with the disadvantage that the number of variables increases. But the handling 
of a linear programming problem is very simple. 



3.3 Linear Programming 



To form a linear programming problem we follow an idea from Pj and solve 
this optimization problem with a linear optimization technique - the wellknown 
simplex method, see Pg. But first of all we have to translate the Li optimization 
problem into a linear optimization problem. Let di = f{xi,yi) be the residual 
of a function f{x,y) for the point (xi,yi). We introduce two new nonnegative 
variables with di = df — d~ ,df > 0, d~ > 0 . Now we consider the absolute 
value 



Mil = \df 



d-\ = 



I dt , if di 
Wi M/ df 



0 

0 ’ 



and construct the following optimization problem 



minyaiue = minj^ {df -k ) | . 

It is easy to see that minyaiue = min Mil| because it is necessary that 

at the minimum at least one of the variables df or d~ vanishes. In our problem 
we have for every point two residuals or defects dli = a;' — auXi — a\2yi — 
oi3 , d2i = y'i — a2iXi — a22Vi — 0,23 ■ This provides us the basic idea for a linear 
programming solver. 
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The standard form of a linear optimization problem (LOP) is: 

T 

c X — minimum subject to Dx = b, a;^ > 0 Vfc . 

D is a (m X n) matrix, m is the number of constraint equations and n is the 
number of variables with n > m. 

Because of that, we have to form all affine variables as differences of non- 
negative variables > 0, a~- > 0. Following the remarks at 

the beginning of this chapter this is also to do for the residuals di using slack 
variables Zk = z:^ — z'jd , > 0, > 0- The constraint equations are now 

+ a^2Vi ~ ^i2Vi + «13 “ ^13 + ~ = A 

ati^i - a^iXi + ~ “^ 2 ^* + aJj “ ~ = Vi 

1=1, ...,N . 

A problem in solving this LO-problem is that the set of all feasible solutions is 
unbounded due to the differences of nonnegative variables Qij = — a“- , afj > 

0,a~j > 0 and Zk = z~^ — z^ , z~^ > 0, z^d > 0. This implies difficulties using 
the simplex method. To overcome these difficulties we introduce an additional 
bounding constraint: 

N N N N 

'^^ki + ^ki + X! ^ z 2 " -h s = limit . 

k,l k,l 2 = 1 2=1 2 = 1 2=1 

Here, s > 0 is an additional slack variable and limit is a given, very large and 
positive constant. The parameters of the affine transformation can be determined 
by solving the following minimization problem: 

N 

’'^^Wi{zlf + zl~ + z2f + z2~) — >■ minimuml 

i=l 

If we have a point list with N corresponding points then our LO-problem con- 
tains 2N -I- 1 constraint equations and 12 -|- 4A^ -|- 1 variables including all slack 
variables. The average complexity of the simplex method is polynomial in the 
number of constraints 2N -|- 1, which is in a fixed relation to the number N of cor- 
respondences. In most practical cases, however, the average complexity is known 
to be nearly linear, see na. We can decrease the complexity decomposing the 
system in two systems estimating the parameters 011 , 012,013 and 021 , 022,023 
separately. 



4 Experimental Results for Synthetic Data Sets 

Three types of evaluation procedures were performed: (i) testing the Li-metric 
and the L 2 -metric using known point correspondences, (ii) testing the new affine 



160 



K. Voss and H. Suesse 



Hu invariants to detect correct point correspondences, (iii) comparison of the Li~ 
metric and the L 2 -metric using the affine Hu-invariants. The tests were applied 
to different types of outliers and different standard deviations of noise. The 
coordinates of all points were in the interval 0 < Xi,yi < 400 Vf. 



4.1 Noise of the Coordinates 

The noise scenario test can be seen in Fig. d The relative errors of the affine 
parameters are be displayed. It can be seen the robustness of the method against 
noise of the point coordinates. 




Fig. 1. Errors of the affine parameters in dependence of all noisy pixels 



4.2 Outliers 

In Fig. [3 the dependence of the errors from (per cent) outliers are demonstrated. 
The outliers have a standard deviation of 100. In Fig. |21the extreme robustness 
using the Li-metric and the affine Hu-invariants is very surprisingly. 
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per cent outliers 



Fig. 2. Errors of the affine parameters in dependence of outliers using the whole system 



4.3 Real-Time Performance 

In Fig. 0 it can be seen the complexity of the method in practical applications. 
The computation time are in seconds done on a 500 MHz PC. It can be seen 
that the complexity is nearly linear. 
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Fig. 3. Real-time performance of the system 
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Abstract. We compute the range flow field, i.e. the 3D velocity field, 
of a moving deformable surface from a sequence of range data. This is 
done in a differential framework for which we derive a new constraint 
equation that can be evaluated directly on the sensor data grid. It is 
shown how 3D structure and intensity information can be used together 
in the estimation process. We then introduce a method to compute 
surface expansion rates from the now available velocity field. The 
accuracy of the proposed scheme is assessed on a synthetic data set. 
Finally we apply the algorithm to study 3D leaf motion and growth on 
a real range sequence. 

Keywords. Range flow, expansion rates, range data sequences. 



1 Introduction 

We denote the instantaneous velocity field that describes the motion of a de- 
formable surface as range flow. The term range flow is used as we derive this 
velocity held from sequences of range data sets. Together with the 3D structure 
the range flow field can be used to study the dynamic changes of such surfaces. 
One interesting question is whether the surface area changes during the motion. 
This can for example be used to study growth processes in biological systems 
such as leaves or skin. 

The same displacement vector held has also been called scene flow when 
computed directly from stereo image sequences m- We present range flow es- 
timation in a differential framework that is related to optical flow algorithms. 
Other approaches that use deformable models have been reported before |1 fl1 2j . 

The contribution of this paper is twofold. First we introduce a new version of 
the constraint equation for range flow estimation that can be evaluated directly 
on the sensor grid. Second we show how a change in the surface area can be 
determined locally from range flow fields. 

2 Range Flow 

We now restate the concept of range flow estimation and introduce a new for- 
mulation of the range flow motion constraint equation. 
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2.1 Motion Constraint 

The observed surface can be described by its depth as a function of space and 
time Z = Z{X,Y,t). The total derivative with respect to time then directly 
yields the range flow motion constraint equation gC2E|: 



ZxU + ZyV -W + Zt = 0 . 



( 1 ) 



Here (U,V,W)"’" is the range flow and indices denote partial derivatives. In order 
to evaluate this equation we need to compute these partial derivatives of the 
depth function with respect to world coordinates. However typical range sensors 
sample the data on a sensor grid, e.g. that of a CCD camera. 

While it is possible to compute derivatives from a surface fit, we rather use 
convolutions. This allows to draw on the well established linear Alter theory and 
can be implemented very efficiently. Towards this end we notice that a range 
sensor produces one data set for each of X,Y and Z on its grid {X = X{x,y,t) 
etc.). Here sensor coordinates are denoted by (x,y). The three components of the 
range flow field are the total derivatives of the world coordinates with respect 
to time {U = ^ etc.). This can be expressed in the following equations: 

U = d^Xx + dyXy + dtX , (2) 

V = d^Y X + dyY y + dtY , (3) 

W = dxZ X + dyZ y + dtZ . (4) 



The total time derivative is indicated by a dot. As we are not interested in the 
rates of change on the sensor coordinate frame we eliminate x and y to obtain 
the range flow motion constraint expressed in sensor coordinates: 



d{Z,Y) d{X,Z) d{Y,X) d{X,Y,Z) 

d{x,y) d{x,y) d{x,y) d{x,y,t) 



= 0 , 



( 5 ) 



where Jacobian of Z,Y with respect to x,y. Notice that the Jacobians 

are readily computed from the derivatives of X,Y,Z in the sensor frame obtained 
by convolving the data sets with derivative kernels. 

In practice many sensors have aligned world and sensor coordinate systems 
which implies dyX = d^Y — 0. Yet Eq. Q poses the general constraint inde- 
pendent of a particular sensor. 



2.2 TLS Solution 

Equation 0 poses only one constraint in the three unknowns U,V,W. This 
manifestation of the aperture problem has been examined in more detail before 
13 . In order to get an estimate we pool the constraints in a local neighbourhood 
and assume the flow to be constant within this area. 

As all data terms in Eq. (0) are bound to contain errors it is reasonable to 
use total least squares estimation as opposed to standard least squares. To do so 
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we rewrite Eq. (0 as p — 0 with a data vector d given by the Jacobians. In 
order to avoid the trivial solution we require that |p| = 1. It is straightforward to 
show that the solution is given by the eigenvector corresponding to the smallest 
eigenvalue of the so called structure tensor 

Jij = B*{di-dj) , i,j = 1... 4: . (6) 

The local integration is computed here via convolution with an averaging mask 
B, typically a Binomial. From the thus estimated parameter vector we can re- 
cover the range flow as: 

f = (UVWf=-{p,P2P3f. (7) 

P4 

The smallest eigenvalue gives the residual of the attempted fit. This can be used 
to define a confidence measure based on a threshold r 0: 




if A4 > r 
else 



( 8 ) 



It is quite possible that the neighbourhood does not contain enough information 
to compute a full flow estimate. This can be somewhat amended by using the 
intensity data as well, how this is done is described next. 



2.3 Including Intensity 

The usage of intensity data in addition to the range data can improve both the 
accuracy and density of the estimated range flow significantly jS|. We assume 
that the intensity does not change with moderate depth changes. Thus, like for 
optical flow, we attribute all changes in intensity to motion. This yields another 
constraint equation: 



0 = dxl i + dyl y + dtl ■ 



( 9 ) 



Combined with m and 0 we obtain an additional constraint on U and V: 



d{I,Y) d{X,I) d{x,Y,i) 

d{x,y) d{x,y) d{x,y,t) 



( 10 ) 



This can also be written as d'^p = 0, where we simply set dg = 0. The intensity 
constraint dEl results in another structure tensor J' constructed following Eq. 
m- The sum of the two tensors yields a combined tensor from which the solution 
is then found by the analysis described above. 

In order to ensure no a priori bias between intensity and depth we also 
require that the two data channels have been scaled such that their values are 
in the same order of magnitude. This can be done by subtracting the mean and 
adjusting the data to have the same variance. Additionally we can use a weight 
factor on the intensity tensor J' to account for different signal to noise ratios. 
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2.4 Dense Flow Fields 



Even the usage of intensity data does not ensure a unique solution in every local 
neighbourhood. Because of the need to compute derivatives of both the range 
data and range flow they are required to vary smoothly. We employ normalized 
averaging to obtain the required smoothness. This averaging is a special case of 
normalized convolution and is computed as follows P|: 



’ B * n " 



( 11 ) 



fl contains a confidence value for each estimated flow according to Eq. Q. 
The range data is smoothed in the same way where the certainty of the range 
measurement is used as confidence values. 



3 Expansion Rates 

There are a number of applications where one wants to determine whether the 
movement and deformation of an observed surface changes its surface area. Ex- 
amples are the study of the effects of temperature changes or mechanical stress 
on materials and the observation of growth rates in biological systems. 

The surface area of a regular surface can be defined as the integral over the 
area of tangential parallelograms 0: 

A = ^\d^r[x,y)y.dyr[x,y)\dxdy . ( 12 ) 

Hence we can define a local area element as \dxV x dyr\. The relative change 
in the local surface area caused by the displacement vector Ar = Atf is then 
given by: 



_ \dg;{r + Ar) x dy{r + Ar)\ 

\d^r X dyv\ 

We are now ready to define the relative expansion rate as: 

e={dA-l)- 100 % . 



(13) 



(14) 



This quantity can again be computed from derivatives, obtained via convolution, 
of both the range data and range flow arrays. 

Because of the need to compute derivatives of both the range data and range 
flow they are required to vary smoothly and should be on the same scale. Thus 
we employ the same normalized averaging on the range data as on the range 
flow, see Sect. 12.41 
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4 Experiments 

In the following all derivatives are computed using 5-tap derivative filters that 
have been designed to give an optimal direction estimation p|. For the integra- 
tion in the structure tensor, see Eq. o, we use a 9 X 9 Binomial. The threshold 
on the smallest eigenvalue is set to r = 0.1 and averaging in Eq. (II I II is done by 
computing the second level of the Gaussian pyramid. 



4.1 Synthetic Data 

The synthetic data used here to test the introduced algorithm consists of an 
expanding sphere. The sensor is taken to use a pinhole camera, as is the case for 
structured lighting or stereo systems. The radius of the sphere is initially 150mm 
and its centre is located at (0, 0, 300)^ mm. The focal length is set to 20mm 
and sensor elements (CCD pixel) are placed at 0.05mm in both directions. We 
apply an intensity texture to each surface element based on the spherical angles 
I = 1(0,^): 



100 if0<O.5° 

|100-k50sin(^)-h50sin(|^) else ' ^ 

Surface expansion is modelled by a change in radius. We use a sequence with an 
expansion rate of 1% per frame, this corresponds to multiplying the radius with 
a factor of 1.00499 between successive frames. Additionally the whole sphere is 
moving with a constant velocity of (0.01,0.02,0.03)^ [mm/ frame]. 

Sensible values for the noise in the range data with a viewing distance around 
150mm are given by az = 0.1mm Q. For the error in X and Y coordinates we 
assume a noise around (Jx,y = 0.01mm and the intensity noise lies in the range 
of aj = 1.0. 

The following table gives the mean values for the relative error in the expan- 
sion rate E^., the relative error in the range flow magnitude Em and the mean 
absolute angle deviation between correct and estimated 3D velocity E^: 



crx,Y 


CTZ 


o-/ 


Ee [%] 


Em [%] 


Ed n 


0.00 


0.0 


0.0 


1.02 


0.001 


0.01 


0.01 


0.0 


0.0 


1.35 


0.002 


0.02 


0.00 


0.1 


0.0 


1.10 


0.001 


0.01 


0.00 


0.0 


1.0 


3.24 


0.003 


0.05 


0.02 


0.0 


0.0 


2.03 


0.003 


0.04 


0.00 


0.2 


0.0 


1.55 


0.001 


0.02 


0.00 


0.0 


2.0 


6.32 


0.004 


0.10 


0.01 


0.1 


1.0 


3.11 


0.003 


0.05 


0.02 


0.2 


2.0 


6.89 


0.005 


0.10 



First we conclude that the range flow field is very accurate even for slightly 
higher noise levels. However the estimation of expansion rates depends strongly 
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Fig. 1. Real castor bean leaf: a depth (Z) data, b U,V-component and c U,W- 
component of the range flow field, d Intensity data. Growth in the range from -2% 
to 3% per honr: e as map and f as textnre on the 3D strncture. 

on the noise. For a standard deviation of 0.1mm in the Z coordinate we can 
compute accurate expansion rates. This accuracy can be achieved with current 
sensor technology. 

4.2 Leaf Growth 

An application that requires the evaluation of expansion rates is the study of 
growth in leaves. As an example we investigate a moving and growing castor 
bean leaf. The 3D structure is captured using a structured light range sensor. 
We sample the leaf every 2 minutes to obtain a sequence of range data sets. 
Figure ^,d show the depth and intensity data for the frame where we compute 
range flow. The flow field is given in Fig. ,c. We see that there is considerable 
motion of the leaf in particular in Z direction. Clearly a lot of this motion does 
not stem from growth alone. The obtained growth rate in % per hour are given 
in Fig. m- In accordance with previous findings from leaves that have been 
confined to a plane we see that growth diminishes as we move from the base to 
the tip of the leaf |^. 

5 Conclusion 

We introduced a new version of the range flow motion constraint equation that 
can be evaluated directly on the sensor grid. In particular there is no need 
to fit a surface model to the range data. A formula for the computation of 
expansion rates from the thus computed range flow fields has been given. We 
could demonstrate that our method is capable to compute accurate range flow 
and surface expansion rates on both synthetic and real data. 
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Abstract. This paper describes a new algorithm for illumination-invariant change 
detection that combines a simple multiplicative illumination model with decision 
theoretic approaches to change detection. The core of our algorithm is a new sta- 
tistical test for linear dependence (colinearity) of vectors observed in noise. This 
criterion can be employed for a significance test, but a considerable improvement 
of reliability for real-world image sequences is achieved if it is integrated into a 
Bayesian framework that exploits spatio-temporal contiguity and prior knowledge 
about shape and size of typical change detection masks. In the latter approach, 
an MRF-based prior model for the sought change masks can be applied success- 
fully. With this approach, spurious spot-like decision errors can be almost fully 
eliminated. 



1 Introduction 

In many computer vision applications, the detection and accurate delineation of moving 
objects forms an important first step. Many video surveillance systems, especially those 
employing a static or quasi-static camera, use processing algorithms that first identify 
regions where at least potentially a motion can he observed, before these regions are 
subject to further analysis steps, which might he, for instance, a quantitative analysis of 
motion. By this procedure, the available processing power of the system can be focused 
on the relevant subareas of the image plane. In applications such as traffic surveillance or 
video-based security systems, this focusing on the moving parts of the image typically 
yields a reduction of the image area to be processed in more detail to about 5-10 percent. 
Obviously, such a strategy is very advantageous compared to applying costly operations 
such as motion vector estimation to the total area of all images. 

In order to let a change detection scheme be successful, an utmost level of robustness 
against different kinds of disturbances in typical video data is required (a very low false 
alarm rate), whereas any truly relevant visual event should be detected and forwarded 
to more sophisticated analysis steps (high sensitivity). Obviously, this presents change 
detection as a typical problem that should be dealt with by statistical decision and 
detection theory. Some early papers |2EI stress the importance of selecting the most 
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efficient test statistic, which should be both adapted to the noise process and to a suitably 
chosen image model. A certain boost in performance has been introduced in the early 
nineties by employing prior models for the typical shape and size of the objects to be 
detected; this can be achieved very elegantly using Gibbs-Markov random fields 050. 
These models reduce very strongly the probability of false positive and false negative 
alarms due to the usage of spatio-temporal context. They are superior to any kind of 
change mask post-processing (e.g. by morphological biters), since both the shape and 
the strength of observed signal anomaly is used. Such algorithms are in the meantime 
widely accepted as state of the art and integrated in multimedia standard proposals such 
as ||1 1|. 

Despite all these advancements, certain problematic situations remain in real-life 
video data, and possibly the most important ones are illumination changes. A rapid 
change in illumination does cause an objectively noticeable signal variation, both vi- 
sually and numerically, but it is very often not regarded as a relevant event. Therefore, 
recent investigations 1 1 0 i 1 21 1 :ill have put emphasis on the desired feature of illumination 
invariance. In contrast to earlier work, the present paper aims at integrating illumina- 
tion invariance in a framework that is as far as possible based on decision theory and 
statistical image models. 

2 The Image Model, Including Illumination Changes and 
Superimposed Noise 

We model the recorded image brightness as the product of illumination and rebectances 
of the surfaces of the depicted objects. We furthermore assume that the illumination is 
typically a spatially slowly varying function (cf. ||T1). 

For change detection, we compare for two subsequent images the grey levels which lie 
in a small sliding windowQ. Due to the spatial low-frequency behaviour of illumination, 
we can assume that illumination is almost constant in each small window. Thus, if 
no structural scene change occurs within the window, temporal differences between 
observed grey levels in the window can be caused only 

1. by a positive multiplicative factor k which modulates the signal and accounts for 
illumination variation 

2. and secondly by superimposed noise which can be modelled as i.i.d. Gaussian or 
Laplacian noise. 

Let us consider the case that this null hypothesis Hg is true: if we order the grey values 
from the regarded windows Wi and W 2 into column vectors Xi and X 2 G IR^, these are 
related by Xi = s -f ei and \2 = k-s + € 2 , where e\,i = l,2, are additive noise vectors, 

E[ei] = E[£2]=0, Cov[ei] = Cov[e 2 ] = (1) 

and s is a signal vector. Unfortunately, the signal vector s is unknown. In such situations 
it might on the brst glance appear reasonable to employ some kind of signal model 

* or a fixed block raster, if detection and speed is of primary interest, and not so much the spatial 
accuracy of the detection masks 
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(e.g. using cubic facets). However, considering the extremely variable local structure of 
real video scenes, especially for outdoors applications, it is certainly not advantageous 
to model the signal blocks as e.g. low-pass signals, since the algorithm would work 
worse in this locations where the signal model incidentally differs strongly from the true 
signal such as in highly textured areas. In an ideal noise free case, Xi and X2 are parallel 
given Hq. We thus formulate change detection as testing whether or not Xi and X2 can 
be regarded as degraded versions of colinear vectors, with factor k and the true signal 
vector s being unknown entities (so-called nuisance parameters). 




Fig. 1. Geometrical interpretation of testing the colinearity of two vectors xi , X2. 



2.1 Derivation of the Colinearity Test Statistic 

Earlier (and simpler) approaches to the task of testing the colinearity concentrated either 
on the difference between the two observed vectors x^ or on regarding the angular differ- 
ence between the x^. It is clearly not advisable to normalize both vectors prior to further 
processing, since this irreversibly suppresses statistically signihcant information. With a 
certain noise level cr^ given, it makes a significant difference whether an certain angular 
difference is found for ’long’ or ’short’ signal vectors - it is simply much easier to change 
the direction of a ’short’ signal vector. Therefore, basing the change detection decision 
on the angle between the observed vectors (e.g. by using the normalized correlation 
coefficient) cannot be recommended. 

The approach we propose instead aims as much as possible on preserving any bit of 
statistical information in the given data. Fig.QJllustrates the derivation of the test statistic. 
Given the observations x^, i = 1,2 and assuming i.i.d. Gaussian noise, a maximum 
likelihood (ML) estimate of the true signal ’direction’ (represented by a unit vector u) 
is given by minimizing the sum = |dip -f |d2p of the squared distances d^ of the 
observed vectors x^ to the axis given by vector u. Clearly, if Xi and X2 are colinear, 
the difference vectors and hence the sum of their norms are zerc 0 . The projections r^, 

^ Note that s, ks, ei , £2 are unknown entities, and ri , r2 , di , d2 are only their estimates. 
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i = 1,2 are ML estimates of the corresponding signal vectors. Obviously, we have 

|di|^ = |xi|^-|r,|^ for i = l,2 
|r^| = ||xi| -cost/Jil = |xf -u| (|u| = 1 !) 

^ |di|^ = |xi|^- |xf -u|^ 

=> ‘‘= |dip-f |d2|^ = |xi|^-f |x2|^- |xf -uf - |x^-uf 

Let us now form the 2x N matrix X with 




|X-ur =u^ -X^ -X-u = 



I 2 \ 'T' I 2 

U| + |X2 -U| 



So it turns out that 

i?2 = |diP+|d2p=|xi|V|x2|"-U^-X^.X.U 
and the vector u that minimizes is the same vector that maximizes 
u^-X^-X-u — >■ max with|u| = l 

which is obviously an eigenvalue problerr0 with respect to matrix X^ • X. Due to the 
special way it is constructed from just two vectors, X^ • X has maximum rank 2 and thus 
has only two non- vanishing eigenvalues. We are only interested in the value of the test 
statistic D^, and fortunately it can be shown that is identical to the smallest non-zero 
eigenvalue of matrix X^ • X. Beyond that, it can be shown (e.g. quite illustratively by 
using the singular value decomposition (S VD) of matrix X) that the non-zero eigenvalues 
of X^ • X and X • X^ are identical. Thus, the sought eigenvalue is identical to the smaller 
one of the two eigenvalues of the 2x2 matrix X • X^, which can be computed in closed 
form without using iterative numerical techniques. So the minimum value for can be 
determined without explicitly computing the ’signal direction unit vector’ u. This whole 
derivation is strongly related to the Total Least Squares (TLS) problem (cf. O) and 
matrix rank reduction tasks in modern estimation theory. 



2.2 The Distribution of Test Statistic 

In order to construct a mathematically and statistically meaningful decision procedure, 
the distribution of the test statistic must be known at least for the null hypothesis 
Hq (= colinearity). The asymptotic distribution of the test statistic can be derived 
on the basis of some mild approximations. For the case that the norm of the signal 
vector s is significantly larger than the expected value for the noise vector norm (which 

^ The matrix • X can also be regarded as a (very coarse) estimate of the correlation matrix 
between the vectors x^, but this does not provide additional insight into the task of optimally 
testing the colinearity. 
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should hold true in almost all practical cases), it can be shown that the sum |di|^ + |d 2 |^ 
is proportional to a variable with iV — 1 degrees of freedom with a proportionality 
factor (7^ according to eg. O. 



( 2 ) 

This result can be intuitively understood: assuming that the probability density function 
of the additive noise for each pixel is a zero-mean i.i.d. Gaussian with the same variance 
for all N pixels, the difference vectors d; reside in a — 1-dimensional subspace of 
IR^ which is determined by the direction vector u. If the length |s| of the signal vector 
is large, the direction u is independent of the additive noise vectors The components 
of di retain the property of being zero-mean Gaussians. 

It might be surprising that the actual value of the multiplicative factor k does not 
influence the distribution of D^, at least as long as the assumption of |s| ^ \ei\ holds. 
This makes this decision invariant against (realistic) multiplicative illumination changes, 
which is of course exactly what it has been developed for. 

The distribution which has been theoretically derived and described above was also 
exactly what we found in a set of Monte Carlo simulations for the test statistic for N 
varying between 4 and 64 and factor k varying between 0.2 and 5 (which corresponds, 
in fact, already to a quite strong multiplicative change of illumination). FiguresQandO] 
show examples of the empirical distributions obtained by these simulations. 




Fig. 2. Empirical distribution of for = 1, 
N = 16, 100000 realizations. These empirical 
distributions do not noticeably change when k 
varies with 0.2 < k <5. 



Fig. 3. Empirical distribution of for cr^ = 
100, N = 64, 10000 realizations. Note the 
conformity to the predicted scaling of ~ 



Testing the null hypothesis Hq (= colinearity) can be expressed as testing whether 
or not can be explained by the given noise model. On the basis of the now known 
distribution of under the null hypothesis, a signihcance test can be designed, which 
boils down to testing against a threshold t which has been determined in such a way 
that the conditional probability Prob[D^ > t \ Hq] = a with the significance level a. 
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3 Integration of the Colinearity Test Statistic into a Bayesian 
MRF-Based Framework 



To improve the (already very good) performance of this test even further, we have 
integrated the new test statistic into the Bayesian framework of earlier, illumination 
sensitive change detection algorithms The Bayesian approach draws its power 

from using Gibbs/Markov random fields (MRF) for expressing the prior knowledge that 
the objects or regions to be detected are mostly compactly shaped. For integrating the new 
test, the conditional distribution p(Z?^|iFi) of the test statistic under the alternative 
hypothesis {Hi) has to be known at least coarsely. The resulting algorithm compares the 
test statistic D to an adaptive context dependent threshold, and is non-iterative. Thereby, 
the new approach is a illumination-invariant generalization of the already very powerful 
scheme 1 6181 which already was an improvement over the iterative proposal EH- 
Under the alternative hypothesis Hi (vectors Xi and X 2 are not colinear), we model 
the conditional pdf p{D‘^\Hi) by 



p{D^\Hi) 




( 3 ) 



with cr^ ^ (7^ (for detail cf. I5I8I I. The assumption that this density is Gaussian does 
not matter very much; it is just important that the variance tr^ is significantly larger than 
and that the distribution is almost flat close to the origin. Furthermore, we model 
the sought change masks by an MRF such that the detected changed regions tend to 
be compact and smoothly shaped. From this model, a priori probabilities Prob(c) and 
Prob(u) for the labels c (changed) and u (unchanged) can be obtained. The maximum 
a priori (MAP) decision rule - given the labels in the neighbourhood of the regarded 
block - is then 

P{D^\Hi) > Prob(u) 
p{D'^\Ho) < Prob(c) 

A little algebraic manipulation yields the context adaptive decision rule 



D2 > 

U 

where is the introduced test statistic, and T a fixed threshold 
which is modulated by an integer number The parameter Vc 
denotes the number of pixels that carry the label c and lie in the 
3 X 3-neighbourhood of the pixel to be processed (see figure). 
These labels are known for those neighbouring pixels which 
have already been processed while scanning the image raster 
(causal neighbourhood), as symbolized by the gray shade in the 
illustration. 



( 5 ) 




For the pixels which are not yet processed we simply take the labels from the previous 
change mask (anticausal neighbourhood). Clearly, the adaptive threshold on the right 
hand side of © can only take the nine different values = 0, 1, . . . , 8. The parameter 
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B is a positive cost constant. The adaptive threshold hence is the lower, the higher the 
number Vc of adjacent pixels with label c. It is obvious that this behaviour favours the 
emergence of smoothly shaped changed regions, and discourages noise-like decision 
errors. The nine different possible values for the adaptive threshold can be precomputed 
and stored in a look-up table, so this procedure needs only slightly more computational 
effort than just applying a fixed threshold. 



4 Results 

Figure0shows some typical experimental results obtained by using the described tech- 
nique. In the used image sequence there is true motion (a toy train) and a visually 
and numerically strong illumination change obtained by waving a strong electric torch 
across the scene. A comparison of image c), where a conventional change detection 
technique is employed, versus image d) (illumination invariant change detection) shows 
the advantages of the new technique very clearly. 




Fig. 4. a), b): Subsequent original frames from a sequence with moving toy trains. A beam of 
light crosses this scene quickly from left to right, c) Result of the illumination sensitive change 
detection algorithm in [Sih mixing illumination changes with the moving locomotives, d) Result 
of the new illumination invariant change detection. The locomotives have been safely detected, 
and all erroneous detection events due to illumination changes are suppressed. 
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5 Conclusions 

We consider the illumination-invariant change detection algorithm presented here as a 
significant step forward compared to earlier (already quite well performing) statistics- 
based approaches. For the near future, a comparison to competing approaches using 
homomorphic filtering (cf. Ill dl l remains to be performed. Furthermore, it appears to 
be very promising to extend the discussed approaches to change detection towards the 
integrated processing of more than 2 subsequent frames. We are convinced that - if again 
statistical modeling and reasoning is employed - a further Improvement compared to 
the state of the art can be achieved. 
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Abstract. Machine learning is a desirable property of computer vision 
systems. Especially in process monitoring knowledge of temporal con- 
text speeds up recognition. Moreover, memorizing earlier results allows 
to establish qualitative relations between the stages of a processes. In 
this contribution we present an architecture that learns different visual 
aspects of assemblies. It is organized hierarchically and stores prototypi- 
cal data from different levels of image processing and object recognition. 
An example underlines that this memory facilitates assembly recognition 
and recognizes structural relations among complex objects. 



1 Introduction and Related Work 

The use of computer vision in automatic manufacturing ranges from control, 
supervision, and quality assessment to understanding of events in an assembly 
cell [13| . Concerning intelligent man-machine interaction in robotic assembly the 
latter is of special relevance. Usual robotic assembly is a step by step procedure 
of building complex objects from simpler ones. The resulting artifacts are called 
mechanical assemblies and are composed of subassemblies. As subassemblies may 
as well consist of simpler units, objects resulting from construction processes 
often show a hierarchical composition. Knowledge of structure and temporal 
evolution of assemblies enables to deduce what is going on in a construction 
cell. This, however, requires techniques for automatic recognition and structural 
analysis as well as a facility to capture information derived at earlier stages of 
construction. 

This contribution presents an approach that copes with these requirements. 
We introduce a system that stores and relates results from visual assembly recog- 
nition. It realizes a learning mechanism which speeds up recognition and accu- 
mulates useful information throughout an assembly process. 

Over the last years, interest in visual assembly monitoring has increased 
significantly. Khawaja et al. ^ report on quality assessment based on geometric 
object models. Dealing with visual servoing Nelson et al. m apply feature based 

* This work has been supported by the DFG within SFB 360. 



B. Radig and S. Florczyk (Eds.): DAGM 2001, LNCS 2191, pp. 17S-|1^^ 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 



Memorizing Visual Knowledge for Assembly Process Monitoring 179 




(a) (b) 

Fig. 1. []_^a^ Examples of objects and assemblies dealt with by our system shown 
with calculated positions of mating features. |1(^ Results of object recognition and 
structural assembly analysis for this example. 



object recognition to cope with misplaced parts. Miura and Ikeuchi cn, Tung 
and Kak cni, and Lloyd et al. 0 introduce methods to learn assembly sequences 
from observation. However, the latter contributions either are not purely based 
on vision or they only deal with simple block world objects. 

Abstracting from detailed geometry we presented syntactic methods to de- 
scribe assembled objects and to generate assembly plans from image data m 
Implemented as a semantic network our method examines images of bolted as- 
semblies (s. Fig. |l(a)j ) and yields hierarchical assembly structures (s. Fig. |l(b)| ). 
As assembly structures and assembly sequences are tightly related, we integrated 
structural analysis and visual action detection to increase the reliability of pro- 
cess monitoring A module for event perception registers which objects are 
taken or put into the scene. Based on observed event sequences it deduces what 
kind of constructions are carried out. If a new cluster of objects appears in the 
scene, it triggers a syntactical analysis to decide whether the cluster depicts an 
assembly and corresponds to the recent sequence of events. 



2 Concept of a Memory Storing Visual Cues 

Whenever the assembly detection component of our integrated system is called, 
it performs an analysis by means of a semantic network. This, in fact, is rather 
time consuming if it is done ignoring knowledge derived earlier. Kummert et 
al. [Z] proposed information propagation through time to speed up semantic 
object recognition. Due to the complex structure of assemblies their approach is 
hardly applicable to assembled objects. Furthermore, hierarchical descriptions of 
assemblies usually are not unique. This complicates assembly recognition since 
matching hierarchical descriptions is burdensome. In the following we propose a 
solution to both these problems. 
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Our method to detect assembly structures analyzes the topology of clusters 
of labeled image regions resulting from elementary object recognition Q. In 
order to determine details of objects connections positions of mating features 
are calculated from the regions and their topological relations are examined |S| . 
Mating features are necessary to interconnect objects; in Fig. |l(a)| calculated 
positions of mating features are indicated by white crosses. 

Experiments suggested that the appearance of a cluster of regions or the 2D 
distribution of mating feature are characteristic features of individual assemblies 
and might be used for recognition. 

This led to the idea of 
a database-like memory 
to store information from 
different levels of im- 
age processing and asso- 
ciate it with correspond- 
ing assembly structures. 

Its conceptual layout is 
sketched in Fig. El The 
memory consists of sev- 
eral tables, called dimen- 
sions, storing data. There 
is an implicit hierarchy of 
dimensions since higher 
level data is computed 
from lower level data. 

Each dimension comes 
along with a specialized 
function to compare enti- 
ties of its content. Here our approach differs from classical databases: since visual 
assembly recognition is a pattern recognition task data which is to be stored in 
the memory might be similar to but usually not identical with earlier stored 
one. Thus we need sophisticated functions computing similarities instead of key 
comparisons known from databases. Data from different dimensions represent- 
ing the same assembly is connected via pointers so that it is possible to retrieve 
information concerning an assembly given just a certain aspect. Note that these 
functions can be chosen freely. The concept sketched in Fig. Elis modular and 
facilitates the exchange of pattern recognition techniques. 

Based on this concept, assembly recognition is done as follows: 

1. Compute the lower most type of information from an image and check 
whether similar one is contained in the lowest dimension. If this is the case 
trace the pointers to the highest dimension, report the information found 
there and stop. If similar information can not yet be found in the lowest 
dimension register it and continue with 2. 

2. Continue image processing. If similar data can be found in the corresponding 
hierarchical level interconnect this entity and the recently registered data 
from the previous level, trace the pointers to the highest dimension and 




Fig. 2. Concept of a memory to store and relate informa- 
tion for assembly recognition. Data from different stages 
of processing is stored in different tables Dimm, m G IN. 
A function CompFunCm is associated with each table to 
compare corresponding datasets. Datasets from different 
tables that represent the same assembly are related via 
pointers. 
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stop. If similar information has not yet been stored in the current dimension 
store it and draw pointers to the corresponding data in the level below. 

3 . Continue with 2 . until the highest level of information is reached. 



3 Implementational Issues and a Performance Example 



In the actual implementation of the memory we realized 4 dimensions. These 
are used to store clusters of regions, sets of 2 D mating feature positions (called 
interest points in the following), syntactic assembly structures, and names as- 
signed to an assembly by a user interacting with the system. To compare two 
clusters of labeled regions the number and types of objects comprised in the 
clusters and the compactness and exentricity of the merged regions are consid- 
ered. This is a rather coarse approach basically suited to recognize assemblies 
that do not change their position and orientation over time. More effort was 
spent in matching interest points. Out of the many known approaches to point 
set matching (cf. e.g. II Oil I and the references therein) we chose two fast ones for 
our scenario: calculating the Hausdorff-distance between two sets and computing 
an affine transformation between point sets optimized by means of a gradient 
descent method. 

Besides its 2 D image coordinates each interest point is assigned the type of 
mating feature it represents. Thus an interest point is a tuple p = (x,t) with 
X G IR^ and t G Type where Type = {BoltThread, BarHole, CubeHole , . . .}. 

Defining a distance d{p,p') = \\x — x'\\ + dType{t,t') where we choose 



dxype (t, t ) 



0 , if f = f' 

^ , otherwise 



X Type becomes a metric space and methods to measure distances between 
sets of interest points are available. To determine the distance from a point set 
Pq to a point set P\ by optimizing an affine transformation the coordinates x 
of all points in both sets are normalized to the unit square. Then two points 
Pi = (xi,ti) and P2 = (3^21^2) chosen from Pq such that 

ll®i~®2ll= max ||a; — a;'||. 

P,p'&Po 



Likewise a pair of points {p\,P2) is chosen from Pi with types corresponding 
to ti and ^2- All affine operator A mapping {x\^x\) to (xi,X2) is estimated 
and applied to the coordinates of all points in Pi . Subsequently, we compute the 
number of elements of equal types. Therefore the two sets Pq and Pi are projected 
to the corresponding multi sets (cf. jHj) of types, i.e. Pi — )> Prype,i = {tj\{x, t)j S 
Pi A tj = t}. Given the multi set intersection Prype = Prypeft H Ppype, ii its size 
n = \PType\ yields the number of elements of equal types in Pq and Pi. Then 
a bijective assignment of type equivalent points p}, Pi G Po,pI G -Pi) 

i G { 1 , . . . , n} is computed such that the distance 
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is minimal. Afterwards A = [o^] is iteratively updated according to aij(T+ 1) = 
~ da-(T) until E falls below a certain threshold or a maximum number of 
iterations is reached. The two sets are called equivalent if E is smaller than the 
threshold. For details and variations of this method please refer to j^. 

The Hausdorff distance between two point sets Pq and Pi depends on a 
method to measure the distance between individual points. Considering the dis- 
tance d : X Type) x x Type) — >■ IR as defined above the Hausdorff 

distance between Pq and Pi is given as H{Pq,Pi) = max{h{PQ, Pi) , h{Pi, Pq)) 
where 

h{PQ,Pi) = max min d{p,p'). 
p6Po p'ePi 

To test the equivalence of two sets of interest point the image coordinates x 
of all points in the sets are transformed to principle axes coordinates and the 
Hausdorff distance is computed. If it is smaller than a threshold estimated from 
test samples, the sets are said to represent the same object. Again, for details 
please refer to 0. 

It turned out that the methods tended to yield different results when applied 
to the same data. We thus realized a simple majority voting scheme (cf. d) to 
increase the reliability of results from point set matching. 

As computing the distance between point sets by means of an affine opera- 
tor mainly considers image coordinates while computing the Hausdorff distance 
considers image coordinates and mating feature types, we decided to compute 
another cue that first of all regards point types. Given two point sets Pq and 
Pi with sizes Uq = |Pol and rii = |Pi| we again look at their multi sets of types 
and compute the corresponding intersection Pxype- Without loss of generality 
let us assume that uq < ni. If n = \PType\ > 0.7no then Pq and Pi are said to 
be equal. 

Now there are enough cues for majority voting. Two sets of interest points 
represent the same entity if at least two of the described methods vote accord- 
ingly, i.e. if at least two methods yield that Pq and Pi are equivalent. 

As Fig. 0 indicates, matching in- 
terest points against earlier derived 
ones is beneficial in terms of com- 
putational effort. The figure sketches 
the average time needed to process 
images of assemblies comprising an 
increasing number of bolts, i.e. im- 
ages of increasing complexity. Obvi- 
ously, the amount of time required for 
syntactical assembly analysis grows 
exponentially. This is due to the fact 
that syntactic methods examine local 
properties of patterns. In our case, local adjacency relations within clusters of 
objects are analyzed to derive assembly structures. The more adjacencies there 
are the more derivations are possible. If a chosen alternative fails to yield a good 




Fig. 3. Computational costs for visual as- 
sembly processing. 
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Fig. 4. A series of images depicting the course of a construction and screen shots 
indicating the internal state of the memory after monitoring this sequence. 



explanation (because many objects in the cluster do not fit into the current 
structure) it has to be discarded and another description must be considered. 
Interest point matching, in contrast, deals with global properties of assemblies. 
It does not aim at explaining local relations between parts but regards distances 
between sets of features of parts. If these sets, as in our case, are of reasonable 
size (an assembly with e.g. nine bolts typically has about 40 interest points) 
combinatorial explosions can be avoided. 

In terms of recognition accuracy voted point matching reaches an average of 
82%. This seems a rather poor performance, however, user interaction can defuse 
this problem. If neither region based matching nor interest point matching detect 
a correspondence to a known assembly, a new syntactical description is computed 
and the regions, the interest points, and the syntactic structure are related and 
stored in the memory. A user then can assign a name to this structure. If he 
chooses a name already contained in the memory, the system has learned another 
prototypical description of the respective assembly. The following discussion of 
an exemplary interaction with the system should illustrate this mechanism. 

Figure El depicts a series of construction steps and screenshots of the state 
of the memory resulting from the process. Initially, the scene contained two 
assemblies. The corresponding region clusters were computed and stored in the 
memory. Sets of interest points were calculated and stored as well. Furthermore, 
syntactical descriptions of both assemblies were generated and a user assigned 
names to these structures. One was called HECK, the other was referred to as 
COCKPIT. A series of mating operations resulted in the second scene. As the 
HECK did not change its appearance in the image no new information concerning 
it was stored. The new assembly was called RUMPF and corresponding data was 
registered. In the final scene HECK and RUMPF were rearranged and new clusters 
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of regions were stored in the memory. However, new sets of interest point did 
not have to be stored since the system found correspondences to the already 
known ones. The large window in the middle of Fig. 2] depicts the contents of 
the memory. Clicking on the entry RUMPF caused all the corresponding facts to be 
highlighted in dark grey. Moreover, as point set matching yielded that COCKPIT is 
a subassembly of RUMPF (cf. 0) all entries belonging to COCKPIT were highlighted 
in lighter grey. 

4 Summary 

This contribution described a system to store and associate different types of 
information from visual assembly process monitoring. This memory is multidi- 
mensional since it registers data from different levels of image processing. If new 
information is drawn from image processing it is stored and related to earlier 
derived facts. By this the memory dynamically learns prototypical features for 
assembly recognition. Is is scalable since new dimensions can easily be integrated 
into the architecture and it is modular because methods to compare data of a 
certain dimension can be exchanged. 

The implemented version of the memory facilitates intelligent man-machine 
interaction in an assembly scenario. Using shape and point matching techniques 
speeds up visual assembly recognition and allows to detect assembly-subassembly 
relationships. If the system fails to recognize an assembly properly by comparing 
earlier derived data, a user can correct its conclusions and the system can use 
this correction to further extend its knowledge. 
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Abstract. In this article we present a new method for visual face 
tracking that is carried out in wavelet subspace. Firstly, a wavelet 
representation for the face template is created, which spans a low 
dimensional subspace of the image space. The wavelet representation 
of the face is a point in this wavelet subspace. The video sequence 
frames in which the face is tracked are orthogonally projected into 
this low-dimensional subspace. This can be done efficiently through a 
small number of local projections of the wavelet functions. All further 
computations are then performed in the low-dimensional subspace. 
The wavelet subspace inherets its invariance to rotation, scale and 
translation from the wavelets; shear invariance can also be achieved, 
which makes the subspace invariant to affine deformations. 

Keywords: face tracking, wavelet decomposition, wavelet network 



1 Introduction 

In this paper we study how wavelet subspaces can be used to increase efficiency 
in image computation for affine object tracking. A wavelet subspace is a vec- 
tor space that is dual to an image subspace spanned by a set of wavelets. Any 
set of wavelets can be understood as spanning a subspace in the image space. 
An image in imagespace can be orthogonally projected into the image subspace 
and the vector of the wavelet coefficients defines a point in the dual wavelet 
subspace. The image subspace (and consequentely the wavelet spacespace) may 
be low dimensional. In order to estabilish tracking, the basic idea is to deform 
the image subspace (or the wavelet subspace, respectively) to let it imitate the 
affine deformation of the input image. When tracking is successful, the weight 
vector in the wavelet subspace should be constant. To be precise, let us assume 
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that at a certain time instance, tracking was successful and the object is indeed 
mapped onto a certain “correct” point in wavelet subspace. As the object moves 
in space, its weak projection into imagespace undergoes a possebly affine de- 
formation. Clearly, the new affinely deformed image, if orthogonally projected 
into the image subspace, will not anymore map onto the same (correct) point in 
the wavelet subspace. However, when the wavelet subspace undergoes the same 
affine deformation as the image, then the image will map again onto the correct 
point, i.e. and tracking is successful. 

Finding the correct deformation is in general a very complex operation. 
Therefore, instead of deforming the wavelet subspace directly, we deform the 
dual image subspace, that is the space spanned by the wavelets. The fact that 
rotation, dilation and translation are intrinsic parameters of wavelets further 
simplifies matters. Indeed, only a small set of local projections of the wavelets 
is needed to compute the unknown deformation parameters. 

We will use the notion of Wavelet Network (WN) in order to formalize the 
above ideas. WNs are a generalization of the Gabor Wavelet Networks, as intro- 
duced in PE). 

In section |21 we will give an short introduction to WNs and establish the 
needed properties. In section|3 we will introduce our subspace tracking approach, 
discuss the details and conclude the paper with the experiments in section El and 
concluding remarks. 

In the reminder of this paper, we will limit the discussion to face tracking, as 
faces appear to be of large research interest [ 512111711 )) . In |2| an efficient, general 
tracking framework is presented. The efficiency of that approach outperforms 
our approach as it uses a discrete template for track. In our approach, we use 
a continuous template, composed by continuous wavelets. Large scale changes 
pose a major problem to discrete approach while it can handled more easily by 
the continuous wavelet template. 

In IP a stochastic mean-shift approach us used for robust tracking. The 
emphasis of this approach is robustness on the expense of precision. 

In [Zj a stochastic approach for simultanious tracking and verification has 
been presented. The employed method uses sequential importance sampling 
(SIS). As templates, discrete gray value template or bunch graphs are used. 

In 1^ Gabor wavelets and bunch graphs were used for tracking faces. Be- 
cause of the large number of used wavelets, this approach was able to run with 
approximately one Hz. 

In [5| tracking method is presented that is based on Gabor wavelet networks. 
The method presented here is a generalization of and outperforms that track- 
ing method in terms of efficiency. 



2 Introduction to Wavelet Networks 

To define a WN, we start out, generally speaking, by taking a family of N 
wavelet functions •F = ,V’njv}j where V’n(x) = '0(SR(x -|- c)), with 

n = {cx^ Cy, 9, Sx, Sy)"’" ■ ijj is called motherwavelet and should be admissible. Here, 
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Number of Wavelets 

Fig. 1. Face images reconstructed with different number of wavelets. As the mother- 
wavelet Ip, the odd Gabor function has been used. 

Cx, Cy denote the translation of the wavelet, Sx, Sy denote the dilation and 9 
denotes the orientation. The choice of N is arbitrary and may be related to the 
degree of desired representation precision of the network (see fig. ^1. In order 
to find a good WN for a function / S L^(K.^) (/ dc-free, w.l.o.g.) the energy 
functional 



E = 



min 

for all i 






( 1 ) 



is minimized with respect to the weights Wi and the wavelet parameter vector 
n^. The two vectors E = (V’ni, • • ■ , and w = (wi, . . . , wn)'^ define then 

the wavelet network (if', w) for function /. 

From the optimized wavelets E and weights w of the WN, the function / can 
be (closely) reconstructed by a linear combination of the weighted wavelets: 



N 

/ = ^ Witpn, = 'f'^W . 



( 2 ) 



This equation shows that the quality of the image representation and reconstruc- 
tion depends on the number N of wavelets and can be varied to reach almost 
any desired precision. 



2.1 Direct Calculation of Weights 



The weights wi of a WN are directly related to the local projections of the 
wavelets tpm onto the image. Wavelet functions are not necessarily orthogonal, 
thus implying that, for a given family W of wavelets, it is not possible to calculate 
a weight Wi by a simple projection of the wavelet '0„. onto the image. In fact, 
a family of dual wavelets = {tpni ■ ■ • V'nw} has to be considered. The wavelet 
ipnj is the dual wavelet of the wavelet ipm iff (V'ni,'0nj) = ^i,j- With E = 



(^m , ■ ■ ■ , we can write 



{E,E) 



= I . In other words, given g G 



') 



and a WN W = {tpm , • ■ • , f/’njv }> 
energy functional are given by Wi 



the optimal weights for g that minimize the 
= {g,tpni)- It can be shown that 



= X! = (V'no V’n,) • 

3 



( 3 ) 
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Fig. 2. A function g € L^(R^) is mapped by the 
linear mapping d' onto the vector w G in 
the wavelet subspace. The mapping of w into 
L^(R^) is aehieved with the linear mapping T. 
Both mappings constitute an orthogonal projec- 
tion of a function g G L^(R^) into the image 
subspace <T' >C L^(R^). 

3 Face Tracking in Wavelet Subspace 

The wavelet representation described in the previous section can be used effec- 
tively for affine face tracking. Basically, this task is achieved by affinely deforming 
a WN so that it matches the face image in each frame of a video sequence. The 
affine deformation of a WN is carried out by considering the entire wavelet net- 
work as a single wavelet, which is also called superwavelet Let w) be a 
WN with W = {ipm , ■ ■ ■ , V'nAr)^ and w = (wi, . . . , wn)"’" ■ A superwavelet is 
defined as a linear combination of the wavelets ifm such that 

= ^WjV’ni(SR(x - c)) , (4) 

i 

where the vector n = (c^,, c^, 0, s^,, Sy, defines the dilation matrix S, the 
rotation matrix R and the translation vector c, respectively. The affine face 
tracking is then achieved by deforming the superwavelet If'n in each frame J, so 
that its parameters n are optimized with respect to the energy functional 

Ll = min||J-if „||2 . (5) 

n 

Clearly, this method performs a typical pixel-wise pattern matching in image 
space, where the template corresponds to the wavelet representation, which is 
affinely distorted to match the face in each frame. Obviously, the wavelet weights 
Wi are constant under the deformation of the template. Therefore, the affine 
deformation is captured only by the deformation of the wavelets, while the weight 
vector remains invariant. 

We thus claim that the “tracking in image space” described above may also be 
achieved in the wavelet subspace which is isomorphic to the image subspace 
< If' >, as illustrated by fig. |21 As it can be seen there, both spaces are related 
through the matrices and As the first step, consider a WN (<f", v) that is 
optimized for a certain face image. As previously mentioned, the optimal weight 
vector V is obtained by an orthogonal projection of the facial image into the 
closed linear span of 'T. Hence, we say that the face template was mapped into 
the weights v G which we will call reference weights. 

We mentioned before that the wavelet template gets affinely deformed in 
order to tracking in image space. Analogously, the tracking in wavelet subspace 
is performed by affinely deforming the subspace < >, until the weight vector 

w G obtained by the orthogonal mapping of the current frame into this 




^ To include the shear in the parameter set of the wavelets, see |S]. 
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subspace, is closest to the reference weight vector v. In fact, this procedure 
performs roughly the same pattern matching as before, but this is done efficiently 
in the low-dimensional wavelet subspace 

The mapping of images into is carried out with low computational cost 
through a small number of local filtrations with the wavelets. Recall from sec- 
tion O that Wi = where ipm = '^j This is equal to the 

following equation: 

= ( 6 ) 

3 

Thus, the optimal weights Wi are derived from a linear combination of wavelet 
filtrations, where the coefficients are given by the inverse of matrix = 
(V'noV’rij)- It can be shown that the matrix 'I'ij is, except for a scalar fac- 
tor, invariant with respect to affine transformations of the wavelet network. It 
can therefore be computed off-line and beforehand. Hence, the weights Wi are 
computed efficiently with eq. (0 through a local application (/, t/jni ) of each of 
the N wavelets ipn., followed by a x TV matrix multiplication. 

Let n = {cx, Cy, 9, Sx, Sy, Sxy) be an affine parameter vector which configures 
a parameterization for subspace < 'P >. As we described before, tracking in 
wavelet subspace is achieved by gradually changing these parameters until the 
projection of the current frame J onto w G is closest to the reference weights 
V. In other words, we must optimize the parameters n with respect to the energy 
functional: ^ 

E = min ||v - w||s/ with Wi = 'Y' (SR(x - c))) (7) 

n ^ ' Sx'Sy O 

3 

where S is the dilation matrix, R is the rotation matrix and c is the transla- 
tion vector, all defined through the vector {cx,Cy, 0, Sx,Sy, Sxy)- During tracking, 
this optimization is done for each successive frame. As there is not much differ- 
ence between adjacent frames, the optimization is fast. To minimize the energy 
functional o, the Levenberg-Marquardt algorithm was used. For this, the con- 
tinuous derivatives of functional O with respect to each degree of freedom were 
computed. 

In equation (jZI) the notation ||v — w||,i< refers to the distance between vectors 
V and w in the wavelet subspace . We define the difference ||v — w||,i< as the 
Euclidean distance between the two corresponding images / and g: 

N N 

\W-y^y=\\^V^3j3n.-^Wj'lj3nj\\2 ( 8 ) 

i=l j=l 

Various transformations lead to 

= (v - w)* Eij (v - w). (9) 



V - w = 



id 



- Wi){Vj - Wj){'lpni,'4’nA 



The matrix of pairwise scalar products Eij = {ipn- , is the same matrix as 
the one in section ITTI and eq. (0. Note that if the wavelets {V'ni} were orthog- 
onal, then would be the unity matrix and eq. 10) would describe the same 
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Fig. 3. Sample frames of our wavelet subspace tracking. Note that the tracking method 
is robust to facial expressions variations as well as affine deformations of the face 
image. 



distance measure as proposed in m- Compared to tracking in image space, the 
method presented here poses a considerable enhancement in terms of efficiency, 
as it provides a great data reduction, considering that tracking is performed 
in the low dimensional wavelet subspace. Moreover, it spares out the compu- 
tationally demanding template reconstruction and pixel-wise sum-of-squared- 
difference computation required in the image-based tracking. 



4 Experiments 

The proposed approach for affine face tracking was successfully tested on various 
image sequences. All test sequences showed a person in motion, more or less 
frontal to the camera, so that the facial features were always visible. The face 
undergoes slight out-of-plane rotation, in-plane rotation, scaling and gesture 
changes. The experiments were carried out offline, see fig. 0for sample imagefl 

On the face of the person, a WN with 116 wavelets was optimized and used 
to estimate the “ground truth” affine parameters in each frame. To analyze 
how tracking precision is related to the number N of used wavelets, we used 
different WN and chose the largest 51, 22 and 9 wavelets, sorted according to 
decreasing normalized weights. The graphs in fig. 0 depict the estimation of the 
face parameters x-position, y-position and angle 6 as well as the ground-truth 
in each frame. It shows how tracking precision increases with the number N 
of wavelets. N can be chosen task dependent and can be changes dynamically. 
The tracked inner-face region had a size of 50x65 pixels. Using only 9 wavelets, 
the computing time for each Levenberg-Marquardt cycle was 15ms on a IGHz 
Linux- Athlon. Higher performance is achieved for smaller face regions, or fewer 
parameters (e.g. just translation and rotation). 

In comparison to the method in we observed in our experiments a speedup 
of a factor two for each single Levenberg-Marquardt cycle, but with a slight 
increase in the number of needed cycles. However, we do believe that the use 
of other algorithms, such as the Condensation method 0 or the Sequential 
Importance Sampling (SIS) method 0, will increase efficiency. 



see http://www.ime.usp.br/~rferis to view one of the test sequences 



2 



192 V. Kruger and R.S. Feris 



9 Wavelets 



22 Wavelets 




51 Wavelets 




-/r-. 










1 50 99 14e 197 246 295 344 m 



■ qrgraundirvfth 

1 50 99 148 197 246 295 344 390 1 90 99 148 197 246 295 344 393 




X 

20 

10 

0 

•10 

•20 

•X 

-40 





- angte-groundtfxilh 




/ ■ 



1 SO 99 146 197 246 296 344 393 



Fig. 4. Estimation of the face parameters x-position, y-position and angle 6 in each 
frame, using WNs with 9, 22 and 51 wavelets. The ground-truth is depicted to illustrate 
the decrease of precision when considering few wavelets. 



5 Conclusions 

In this paper we have presented a tracking method, that is carried out in wavelet 
subspace. Since wavelets are invariant to affine deformations they leave the ref- 
erence vector of a template face constant. Furthermore, the direct relationship 
between the wavelet coefficients and the wavelet filter responses, which is widely 
used for multi-resolution analysis and for motion estimation, allows to map an 
image into the low-dimensional wavelet subspace M.^ , where all subsequent com- 
putations can be carried out. 

The method has the further advantage, that its precision and computation 
time can be adapted to the needs of a given task. When fast and only approx- 
imate tracking is needed, a small number of filtrations is usually sufficient to 
realize tracking. When high precision tracking is needed, the number of wavelets 
can be gradually increased. This implies on the one hand more filtrations, but en- 
sures on the other hand a higher precision, as we have shown in the experimental 
section. 

Very many tracking algorithms have difficulties dealing with large scale 
changes. We think that the use of continuous wavelets and the possibility of 
dynamically changing the number of used wavelets could simplify matters. 

So far, the optimization of the wavelet networks has been done with a 
Levenberg-Marquardt algorithm. We think that a combination with Conden- 



Wavelet Subspace Method for Real-Time Face Tracking 



193 



sation or Sequential Importance Sampling (SIS) could help to further decrease 
computational complexity. This will be evaluated in future work. 
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Abstract. We present a new technique for estimating the sea surface 
heat flux from infrared image sequences. Based on solving an extension 
to the standard brightness change constraint equation in a total least 
squares (TLS) sense, the total derivative of the sea surface temperature 
with respect to time is obtained. Due to inevitable reflexes in field data 
the TLS framework was further extended to a robust estimation based 
on a Least Median of Squares Orthogonal Distances (LMSOD) scheme. 
From this it is possible for the first time to compute accurate heat flux 
densities to a high temporal and spatial resolution. Results obtained at 
the Heidelberg Aeolotron showed excellent agreement to ground truth 
and field data was obtained on the GasExH experiment. 



1 Introduction 

The net sea surface heat flux j is a crucial parameter for quantitative measure- 
ments of air-sea gas exchange rates, as well as for climate models and simulations. 
Meteorological estimates currently employed to measure j have poor spatial res- 
olution and need to average for several minutes to produce results. 

There exists strong experimental evidence that surface renewal is an 
adequate model for heat transfer at the air-sea interface (uni, 0). In this model 
a fluid parcel is transported from the bulk to the surface were it equilibrates by 
a mechanism akin to molecular diffusion. It has been shown that based on this 
model assumption, it is feasible to compute j by measuring the temperature 
difference AT across the thermal sub layer and its total derivative with respect 
to time d/dtZ\T 0. AT can be derived from infra-red images alone 0. In the 
present work we will show how the total derivative can be estimated from IR 
image sequences and thus heat flux measurements be made to a high spatial 
resolution at the frame rate of the IR-Camera. 
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Fig. 1. A typical sequence as taken with an infrared camera at lOOHz. While the 
reflexes in region a are quite easy to detect, the reflexes in region b and in the left top 
corner of the images are hard to detect in the still frames. However, they introduce 
significant errors in the estimation of the heat flux density j. 



In an IR-Camera the sea surface temperature (SST) is mapped as a grey 
value g onto the image plane. As the SST changes due to physical processes, 
the brightness conservation does not hold. Therefore the brightness change con- 
straint equation (BCCE) commonly used in optical flow computations has to be 
extended to be applicable in the context of this work. The extended BCCE can 
then be solved in a local neighborhood in a total least squares (TLS) sense akin 
to the structure tensor. This is shown in section 12.1 1 Due to reflections in the 
image data from held experiments the proposed model of the extended BCCE 
is violated. This problem can be overcome by embedding the TLS framework 
in a robust estimator. In the context of this work a random sampling approach 
was chosen that is based on the Least Median of Squares Orthogonal Distances 
(LMSOD) scheme p. This will be outlined in sectional In section P the actual 
results of the estimate of j with the proposed algorithm are displayed. 

2 The Total Derivative from Image Sequences 

The total derivative of AT with respect to time is readily given by 

^Z\T+(«V)Z\r, (1) 

where u = \ui,U 2 ,u^Y^ is the three dimensional velocity of the fluid parcel. 

To determine d/dtAT of a fluid parcel in equation (PJ by digital image pro- 
cessing, it is not sufficient to extract the change of temperature dT jdt at a fixed 
position of the image, which is quite trivial. Moreover an exact estimate of the 
velocity u has to be found. In the following sections a means for deriving d/dtAT 
directly from an image sequence will be presented. 

2.1 The Extended Brightness Model 

A very common assumption in optical flow computations is the BCCE p. It 
states that the image brightness g{x,t) at the location x = {x\,X 2 )^ should 
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change only due to motion 0 , that is, the total derivative of its brightness has 
to vanish: 



dg dg dg dx dg dy 

dt dt dx dt dy dt 

with the optical flow f = (dx/dt,dy/dt)^ 
and the partial time derivative gt = dg/dt. 



9t + (/V)5 = 0, (2) 

(u, the spacial gradient Vg 



When using an infrared camera for acquiring image sequences, the temper- 
ature r of a three dimensional object is mapped as a grey value g onto the 
image plane. Thru camera calibration the parameters of this transformation 
are known. From comparing the two equations ([Q) and 0 it is evident that 
the BCCE does not hold in the context of this work. In order to satisfy this 
constraint, the temperature change we seek to measure would have to be equal 
to zero. 



It is known that the BCCE can be generalized to incorporate linear and 
nonlinear brightness changes based on the differential equation of the physical 
processes involved 0. Thus, the BCCE is generalized by adding a linear term c 

EM 

9t + (/Vg) - c = {g^, gy, -1, gt) ■ (u, v, c, 1)^ = dJp = 0, (3) 

where c is a constant, which is proportional to d/dtAT. The data vector 
d = {gx,gy, is given by the partial derivatives of the image data, which 

are denoted by subscripts. Equation OS! poses an under determined system of 
equations, as there is only one constraint with the three unknowns f = (u,v)^ 
and c. Assuming constant optical flow f over a small spatio-temporal neigh- 
borhood surrounding the location of interest containing n pixels, the problem 
consists of n equations of the form of equation ®. With the data matrix 
D = (di, . . ,,dn)^ the TLS problem can be reformulated as an extension of 
the structure tensor |^, that is 

\\DpW 2 = Dp — Fp — )► min. (4) 



with p^p = 1 to avoid the trivial solution p — 0. The eigenvector e = 
(ei, 62 , 63 , 64 )^ to the smallest eigenvalue A of the generalized structure tensor 



F = D^D = 



^ ^ 9x * 9x ^ ^ 9x * gy ^ ^ gx ^ ^ gx ‘ 9t 
9y ' 9x > <■ gy ' gy > gy > gy ' gt > 
<9x> <9y> < 1 > <9t> 

\< 9f 9x> <9f 9y> < 9t> <9f 9t> J 



(5) 



represents the sought after solution to the problem jSj. In this notation local 
averaging using a binomial Alter is represented by < • >. In the case of full flow, 
that is no aperture problem present, the parameter vector p is given by p = 
1 / 64 ( 61 , 62 , 63 )^. Due to the inherent structure of the infrared image sequences 
an aperture problem seldom exists and only full flow needs to be considered. 
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Fig. 2. An Illustration of the LMSOD estimator. The solid line representing the correct 
estimate (connecting point A and B) has a much smaller sum of squared residuals than 
the wrong estimate (the dotted line connecting point C and D). The wrong estimate 
is therefore rejected. 



3 Parameter Estimation in a Robust Framework 

In controlled laboratory experiments reflexes on the water surface from objects 
of a different temperature can be greatly suppressed, as is the case in the 
dedicated Heidelberg wind wave facility, the Aeolotron. However, in held exper- 
iments reflexes stemming from different sources are inevitable. These reflexes 
vary greatly in appearance. They can, however, be viewed as non gaussian noise 
on the image data. TLS approaches exhibit very favorable properties in the 
case of gaussian noise but are very susceptible to outliers in the data. In the 
case of reflexes, the ratio of outliers in the data to that to inliers can be as high 
as 50%. If undetected, the reflexes are bound to introduce significant errors in 
the estimation of the net heat flux, often rendering results useless. Therefore 
their detection is paramount. It would of course be favorable if the data at the 
reflexes would not only been cast away, but could be corrected and used in 
subsequent data analysis. 

The parameter estimation based on TLS is made robust by means of 
LMSOD fp. This is a random sampling scheme were a number of m subsamples 
are drawn for which equation (0) is solved exact. The subsample for which the 
median of the residuals to the rest of the data set is smallest is chosen as a 
preliminary estimate. From this estimate the inliers are selected and a TLS 
performed. An illustration of this algorithm is displayed in figure 0 

In more detail the algorithm can be described as follows: First, a subsample 
Dj = (di,- • • ,dfc)^ of k observations is drawn, were k is equal to the number 
of parameters to be estimated, that is fc = 4 for equation O- From such a 
subsample Dj, an exact solution for the parameters can be found, which is 
equal to solving a linear system of k equations. The result is the trial estimate 
vector pj. It is of course very likely that these trial estimates will stray far 
from the sought of estimate. Therefore, the residuum for the trial estimate is 
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calculated from every observation i G — fc}, leaving out the subsample 

J, that is Ti = djpj. These trial residuums make up the residuum vector rj — 
(ri, . . . , to a given subsample Dj. In a next step the median of rj 

is computed. The whole process is repeated for a number of subsamples J G 
m} and the trial estimate with the minimal median rj retained. Central 
to this estimator is the question, of how many subsamles one has to draw to 
converge to the right solution. A number of random selections for subsamples 
has to be drawn, such that the probability of at least one of the m subsample 
equating to the right estimate is almost 1. Assuming that n/p is a large number, 
this probability U is given by 

^ (e) 

where e is the fraction of contaminated data HH. This equation shows that 
the number of subsamples that have to been drawn is significantly less com- 
pared to sampling every possible combination of points, which is given by 
m = n\/{{n— p) \ ■ pi). 

The LMSOD estimator has one debilitating drawback, namely its lack of 
efficiency convergence), which is exactly the big advantage of maximum 

likelihood estimators like the TLS. Therefore the LMSOD estimate is used to 
find inliers on which the TLS estimator is then applied. 



3.1 Detection and Elimination of Outliers 

Outliers in the data set can of course be characterized by their big residual r as 
compared to the inliers. This residual has of course to be scaled according to the 
rest of the data in order to be thresholded it in a meaningful way. One possible 
way of calculating the scale factor is given in HU. First an initial scale estimate 
s° is computed, according to 

s° = 1.4826 • H median rf (7) 

The median of the squared residuals in equation o is the same value by which 
the final estimate was chosen which makes the computation of s° very efficient. 



The preliminary scale estimate s° is then used to determine a weight Wi for 
the Ah observation, that is 



r 1 if \n/s°\ < 2.5 
0 otherwise 



By means of these weights a final scale estimate S independent of outliers is 
calculated by 



5 " = 



I 



( 9 ) 
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Fig. 3. a)An image from the IR-sequence, b)The number of weights as computed by 
LMSOD. Black areas indicate fewer weights which corresponds to reflexes, c) The flow 
field and the linear parameter for TLS and LMSOD inside the box in a. Black regions 
indicate where no parameters could be estimated. 

The final weights are chosen similarly to equation Q with the preliminary 
scaling factor s° being replaced by the final scale factor S. With these final 
weights a weighted total least squares estimate (WTLS) is computed. Since the 
weights are binary (either 0 or 1) this is the same as using a TLS solemnly on 
the inliers. 



4 Results 

The field data presented in figures Q and 0 was taken during the GasExII 
experiment at the equatorial pacific this spring. Therefore only preliminary 
analyses have been performed. As can be seen in figure 0 the robust estimator 
detects the reflexes and adjusts the weights accordingly. Even reflexes that are 
hard to see in the single frame are detected and can be removed. Subsequently 
the flow field is not affected as the one computed from TLS alone, nor is the 
total derivative of AT with respect to time. The use of a robust estimator with 
its higher computational cost is thus justified by the results. 

In order to verify the approach presented in this paper, laboratory experi- 
ments have been performed at the Heidelberg Aeolotron, a dedicated wind wave 
facility that exhibits excellent thermal properties. The experimental set-up as 
well as some results can be seen in figure 0. In experiments at different wind 
speeds, relative humidity, as well as air and water temperatures are recorded to 
a high accuracy. From these physical quantities the net heat flux can be calcu- 
lated 0. There is excellent agreement between the two estimates. Our technique 
exhibits heat flux measurements on space and timescales that were never before 
attainable. Therefore fluctuations that are correlated to wave motions have been 
measured for the first time. 
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Fig. 4. a) The basic experimental set-up for experiments at the Heidelberg Aeolotron. 
Wind is generated by paddles and the temperatures of air and water, as well as relative 
humidity are recorded, as are wind speed and the velocity of the water, b) Comparison 
of heat flux measurements from the proposed technique with that calculated from 
conventional means. Fluctuations are not errors, but are correlated to wave motion. 



5 Conclusion 

A novel technique for measuring the net heat flux at the sea surface was pre- 
sented. The BCCE was extended to include a term for linear brightness change 
to account for temperature changes in the IR sequences due to heat dissipation. 
This extended BCCE can be solved in a TLS framework, similar to the structure 
tensor in optical flow computations. This technique was made robust by means 
of LMSOD in order to account for reflexes, common in field data. It was shown 
that reflexes can be detected and correct parameters estimated. The validity 
of our technique could also be shown in laboratory experiments. In the future 
interesting phenomena uncovered with this technique for the first time will be 
investigated further. 
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Abstract. The detection of approaching vehicles in traffic lanes is an essential processing step for a driver 
assistant or a visual traffic monitoring system. For this task a new motion based approach is presented, which 
allows processing in real-time without the need of special hardware. Motion estimation was processed along 
contours and restricted to the observed lane(s). Due to the lane based computation vehicles were segmented 
by evaluation of the motion direction only. The contour tracking algorithm allows a robust motion estimation 
and a temporal stabilisation of the motion based segmentation. The capabilities of our approach are 
demonstrated in two applications: a overtake checker for highways and a visual traffic monitoring system. 



1 Introduction 

Information about vehicles in traffic lanes is essential in systems for driver assistance or visual traffic 
monitoring. Robustness against changes in illumination, real-time processing and the use of low-cost hardware 
are requirements, which will affect the commercial success of such a vision system. 

Motion is a fundamental information for the understanding of traffic sequences. Unfortunately motion estimation 
is computationally expensive. To reach real-time without a sophisticated and therefore expensive hardware, 
motion estimation is only done at significant points ([1],[2],[3]) or constraints about the expected motion are 
used ([4]). In [1],[2] and [3] feature points like corners are used as significant points. Motion estimation at 
feature points leads to a sparse motion field, which makes it difficult to detect complete objects. Taniguchi ([4]) 
presents a vehicle detection for the observation of the rear and lateral view. He uses the assumption, that passing 
cars come out from the focus of expansion (FOE) while the background drains into the FOE. However, this 
assumption is only true at straight lanes. Also a possible offset, caused by vibrations of the observing vehicle, is 
not considered. Due to the vehicle dynamic and unevenness of the roadway these vibrations occur permanently. 
This paper presents a new approach for motion estimation in real-time applications. Motion is computed via a 
temporal tracking of contours. Instead of using single feature points, which leads to a sparse motion field, 
contours are a good compromise between data reduction and motion information. Because visible vehicle 
boundaries are described by contours, grouping contours will lead to a complete vehicle segmentation. 
Additionally contours can be tracked by evaluating shape only. This makes motion estimation more robust 
against changes in illumination. Motion computation based on local operators suffers form the aperture problem 
[10]. This will be reduced by the evaluation of contours. 

However, contour extraction produces failures: one contour can be separated in parts; or not related contour parts 
can be connected. In order to overcome this problem, contour tracking is not solved via the direct evaluation of 
temporal successive contours. Instead the optimal correspondence between the actual contours and the previous 
edge image is computed via a correlation based approach. 

Motion estimation is only done inside the observed lane(s). This reduces the time for computation and 
additionally simplifies the motion based segmentation. Contour tracking leads to a robust motion estimation and 
allows the temporal stabilization of the motion based segmentation. Chapter 2 explains the motion based vehicle 
detection. In chapter 3 the possible applications of the presented approach are demonstrated in two examples. 
The paper ends with a summary in chapter 4. 



2 Motion Based Vehicle Segmentation 

The main processing steps of the proposed motion based vehicle detection are shown in fig. 1. I{x,y,t) denotes 
the actual camera image. An example for a rear and lateral observation on highways is given (see image a of fig. 
1). Eor the presented vehicle segmentation knowledge about the observed lane is assumed. In case of a static 
observation area this information can be predefined or in the other case the results of a lane detection module are 
used. In the example of fig. 1 a predefined observation area is indicated. The first processing step is a lane based 
transformation, which projects the observation area on a rectangle, denoted as M(u,v,t) . Vehicles are detected 
based on their motion. Motion is estimated along contours, which are tracked over time. Therefore contours have 
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to be extracted from M{u,v,t) . Contours are described by v"(t) , where N stands for the numbers of contours. 
The result of this step is illustrated in image c of fig. 1 . The motion information at the contour points are saved in 
. The results of the temporally stabilized motion segmentation are stored in a binary mask, Mask(u,v,t) 
(see image e in fig. 1). From this mask the corresponding bounding box can easily be extracted. The position of 
the vehicle is then projected back in the camera image (see image a in fig. 1). Details of the single processing 
steps are presented below. 
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Fig. 1. The figure illustrates the processing steps of the proposed vehicle detection. 



2.1 Lane based transformation 

The goal of this processing step is to transform the observed lane area from a camera image to a rectangle. The 
principle of this transformation is illustrated in fig. 2. 





Fig. 2. The left part of the figure illustrates the principle of the lane based transformation. The observed area is given by two 
parametric functions, which specify the left and right boundary, f,(f) and f, (f). The right part illustrates two possible 
choices for 1 = p„(v) (explained below). 

The result of the proposed transformation is similar to a birdview generated from inverse perspective 
transformation. In contrast to other methods (e.g. [9] no camera parameters are used. The transformation uses 
only two functions, f,(r) and f^(r) , which specify the left and right boundary of the observed lane. 

The transformation should map an observation area in the camera image to an rectangle, which is formulated as 
l(x(u,v),y(u,v)) ^ M(u,v) . The ranges of the image coordinates are given by xelO.k'] and ye[0,K] for the 
camera image and ue [0,f/] and ve [0,V] for the mapped image. The two boundary functions are given as two- 
dimensional polynomial functions, fi(t) = [/^(t),/,,(?)]^ and fr(f) = [/„(t),/,^(t)]^ , where t lies in the range of 
te[0,l]. So the upper boundary can be described by f„(f') = (fr(l)-fi(l))t^ + fi(l) and the lower boundary by 
f^(f') = (fr(0)-fi(0))t'-Hfi(0) with f'e[0,l].Now x(u,v) and y(u,v) result from the substitution t = p^{u) and 
t' = PAv) : 
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x(u,v) = {f^^(pjy)) - 

>?(m,v) = (4,(P,(v))-4,(P„(v)))p„(m) + 4,(P„(v)) . 

Values of the functions p„(v) and p„(u) must lie in the range of /7^(v),p„(M)e [0,1] . One possible choice could be 

P„(v) = v/K and p^^{u) = ulU (see fig. 2). This leads to a linear sampling of the observation area. 

The lane-based transformation offers some advantages: 

1. Representing the observed area in a rectangle is computational efficient. Saving and processing of not 
observed areas is reduced. 

2. Vehicles move along lanes. Therefore motion reduces to a quasi one-dimensional vector. 

3. A suitably choice of p„(v), e.g. p„(v) = Vv/V (see fig. 2), reduces the influence of the camera perspective. 
Especially the differences between motion in near and far image areas can be compensated. This leads to 
smaller search areas in the motion estimation process. Additionally the dimension of the mapped image can 
be significantly reduced. In the presented examples of chapter 3 a dimension of 120 rows to 80 columns was 
sufficient. 



2.2 Contour Generation 

The second processing step in fig. 1 is contour generation. This step is necessary because motion is computed by 
a contour tracking. Contours have to be extracted form the gray-value image, M(u,v,t) . Our approach uses the 
edge detection method described in [6] and the edge linking method described in [7]. The output of the contour 
generation is a contour structure denoted as v"(t) (see fig. 1). 



2.3 Motion Estimation 

Motion estimation is solved by contour tracking. An extended version of this approach can be found in [5]. The 
block diagram of fig. 3 shows an overview of the necessary processing steps. 




Fig. 3. Block diagram for contour based motion estimation 

The first task is to find some correspondence between contours in temporal successive images, which is called 
contour matching in fig. 3. The extracted contours at time t are denoted as v"(f) ={vo(fc,f),...,v„_i(k,t)} . To 
track one contour v,(k,r) over time an optimal transformation has to be determined, which transforms a part of 
v,(k,t) to a corresponding part of v^(/,r-l). Due to the processing of contours, which are generated from 
images, no knowledge of the complete contours v,(fe,f) and v^(/,t-l) can be assumed. This situation is 
demonstrated in fig. 4. 
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Fig. 4. The left part of the figure illustrates a realistic situation for matching contours, which are extracted from images. The 
dotted parts indicate missing contour information. The right part illustrates the principle of the contour matching formulated 
in equation (2). The white vector indicates the optimal translation d,^ . To show the distance image the minimal distances are 
expressed as intensities. 



This optimal transformation is computed at every contour point and it is assumed that it can be modeled by a 
simple translation. For every contour point the following task has to be solved: Find the best match between the 
actual contour part \.(k.,t) surround each single point and the previous contour points, v"(r - 1) . To measure the 
best match the previous contour points are processed by a distance transformation. This step leads to a distance 
image, where every image pixel is set to the minimal distance to a point of v"(r-l) (see Fig. 4. (right) for an 
example of this distance image). The actual contour part is then moved over the distance image. For every 
translation the distance values along the contour part are summed up. The optimal translation is the translation 
leading to the lowest sum of distance values. More formally, for a contour point \.(k.,t) the optimal translation, 

d,j , is found by minimizing the following energy: 

x-i (2) 

£(d,) = ^D».^,_Jv,(k,0 + d,). 

k=Q 

i)(w,v,f-l) denotes a distance image, which is generated by a distance transformation based on the 
extracted contours at time r - 1 , v" (r - 1) and is expressed as 

£)v„.^^_i)(u,v,t -1) = min||[M,v]^ - v*' (f -1)|| . (3) 

Efficient algorithms (parallel or sequential) of the distance transformation can be found in [8]. Fig. 4 illustrates 
the contour matching formulated in equation (2). 

To track contour points over several images, it is necessary to find a temporal predecessor for every point. 
Therefore equation (2) has to be modified: 

f 2 (4) 

£(d,)= ifD,,„(v,(fc,,t) + d,) = 0 ^ 

MAX_VAL else 



This formulation enforces that only translations are possible, which move the contour point \.(k.,t) explicit to 
one point on v" (k,t-l) . 

The presented contour matching allows the tracking of contour points over several images. This is used to 
integrate the motion vectors over time, which leads to a temporal stabilized motion estimation. The stabilized 
motion vectors at every contour point are denoted as m"(t) . 

For images of the dimension 184 columns to 128 rows the presented motion estimation implemented on a 
Pentium III (600 MHz) is running in real-time ( < 40 msec ). 



2.4 Temporal Stabilization of Motion Segmentation 

In this processing step vehicles are detected based on their motion. The contour tracking, developed in the 
previous section, allows a temporal stabilization of the motion based segmentation. Because of the temporal 
correspondence of contour points it is also possible to track how often a contour point was selected before. This 
information is stored in an image called Age(u,v,t) . 

Temporal stabilization needs the actual contour structure, v"(t) , the corresponding motion vector, m"(t) , the 
explicit assignment of the predecessors, d'^(f) and the Age(u,v,t~Y) as input. The algorithm consists of the 
following steps: 

1 . Select all contour points with a negative motion component in v direction. This simple motion segmentation is 
possible because of the lane based transformation. These selected points are treated as potential contour points 
of a vehicle. This is illustrated in fig. 1 d. The selected points are denoted as p" it) . M specifies the number of 
selected points. Additionally, at the selected points the explicit assignment of the predecessors are denoted as 
d"(0. 

2. If a selected point, p, , was also detected at time t- \ , increment the age of the point by 1, else initialize the 
age with 1. Formally, this can be expressed as follows: first, at every pixel Age(u,v,t) is set to 0; second, at 
every selected point the following iteration is done: 
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AgeiPui > Pv, . 0 = Age( , p„. - , f - 1) + 1 . 



(5) 



3. All selected points, which are older than a given threshold are treated as true contour points of a vehicle. 
These true points are marked in the mask image, Mask(u,v,t) : 



Mask(u,v,t) 



fl if 

|o else 



( 6 ) 



Mask(u,v,t) is then dilated to generate closed vehicle masks. The results of this processing step is illustrated 
in fig. 1 e. 

4. Vehicles are finally detected in Mask{u,v,t) . This is achieved by finding the corresponding bounding box. 



3 Applications 

In this chapter two applications of the proposed vehicle detection are presented. The first application was already 
illustrated in fig. 1. The observation of the rear and lateral lane on highways is for example necessary for an 
overtake checker. The motion based vehicle detection, presented in the previous chapter can be used directly for 
this application. The left part of fig. 5 shows the results of the vehicle detection for the overtake checker. The left 
column contains the detected vehicles in the camera image. The right column shows the corresponding 
transformed images: the transformed camera image; the extracted contours; the selected contour points and the 
temporally stabilized motion masks. Additionally the bounding boxes, extracted form the mask image, are 
indicated. The dimension of the transformed images are 120 rows by 80 columns. On a Pentium III (600 MHz) 
the “C”-implementation of the proposed vehicle detection takes about 10 msec. The exact processing times 
depends on the number of contour points. 

The second application is a visual traffic monitoring system. Results of the motion based segmentation are 
shown in the right part of fig. 5. At six successive instants the transformed image, the extracted contours with 
motion vectors and the generated motion masks are displayed. The time index is written in the left upper corner 
of the transformed camera image. Index 300 shows the situation of the beginning of a red period (the first 
vehicle is already waiting), index 400 during a red period, and index 900 at the beginning of a new green period. 
Due the overlap of the vehicles in the upper lane area, moving vehicles are detected as one group. Single 
vehicles are only detected as they approach to the camera. 

Fig. 6 demonstrates that the proposed vehicle segmentation is robust under different lighting conditions and 
independent to different vehicle types. 



4 Summary 

This paper presents a new approach for the detection of vehicles. Segmentation is based on motion only. The 
limitation of the processing to the observed lane(s) and a new concept for contour tracking, allows processing in 
real-time. On a Pentium III processor (600 MHz) a pure “C”-implementation requires 10 msec. The dimension 
of the transformed images is set to 120 rows by 80 columns. 

Contour tracking, explained in chapter 2, leads to a robust motion estimation. Additionally, it allows the 
temporal stabilization of the motion segmentation. Vehicles, are detected reliable and errors of motion 
segmentation are rejected. The proposed approach allows also contour tracking over a long time period. It is 
possible to track single contours from their entry to their leaving of the observation area. The transition time can 
be used to measure the effective velocity or waiting time at a traffic light. 

The presented contour tracking allows the use of motion as basic information in real-time applications. It is 
possible to combine motion information with other approaches, e.g. stereo processing or model based 
approaches, for a further improvement of robustness. 
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Fig. 5. Results of the motion based vehicle detection for the overtake checker (left) and visual traffic monitoring (right). 




Fig. 6. Vehicle segmentation is robust under sunny conditions (left) and independent of vehicles types (right). 
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Abstract. Mosaic images provide an efficient representation of image 
sequences and simplify scene exploration and analysis. However, the ap- 
plication of conventional methods to generate mosaics of scenes with 
moving objects causes integration errors and a loss of dynamic infor- 
mation. In this paper a method to compute mosaics of dynamic scenes 
is presented addressing the above mentioned problems. Moving pixels 
are detected in the images and not integrated in the mosaic yielding a 
consistent representation of the static scene background. Furthermore, 
dynamic object information is extracted by tracking moving regions. To 
account for unavoidable variances in region segmentation topologically 
neighboring regions are grouped into sets before tracking. The regions’ 
and objects’ motion characteristics are described by trajectories. Along 
with the background mosaic they provide a complete representation of 
the underlying scene which is idealy suited for further analysis. 



1 Introduction 

An important topic in computer based scene exploration is the analysis of image 
sequences, since motion within the scene cannot be extracted using single im- 
ages only. However, the resulting amount of data to be processed usually limits 
the application area of image sequences. One possibility to reduce the amount 
of data is to create mosaic images. In doing so a sequence is integrated in one 
single mosaic image thus removing redundancies within the sequence. Apply- 
ing conventional methods to generate mosaics to sequences with moving objects 
yields integration errors and a loss of dynamic information (see e.g. P^)- In the 
works of Megret p] and Davis Pj moving areas within a sequence are therefore 
detected and omitted for integration. Thus integration errors can be avoided but 
dynamic information still is lost. Cohen [2| suggests tracking of moving regions 
based on dynamic templates, however, if the shapes of objects change signifi- 
cantly templates are not sufficient. In Irani ^ tracking is realized by temporal 
integration, but no explicit data extraction is suggested. 

* This work has been supported by the German Research Foundation (DFG) within 
SFB 360. 

B. Radig and S. Florczyk (Eds.): DAGM 2001, LNCS 2191, pp. 20S-|2^ 2001. 
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The method presented in this paper is based on a two-step strategy for each 
image to process resulting in a mosaic of the static scene background and tra- 
jectories describing the object motion. In the first step pixels belonging to pro- 
jections of moving objects, in the following referred to as moving pixels, are 
detected resulting in a motion map and omitted during subsequent integration 
of greylevel information to generate the background mosaic. In the second step 
regions, referred to as moving regions, are first extracted from the motion maps. 
They are subsequently grouped into connected components and matched against 
the components of the previous image. Thus temporal correspondences can be 
established and trajectories are derived. 

The paper is organized as follows. In section O a brief introduction to image 
alignment and detection of moving pixels is given. Section 0 outlines the tem- 
poral correspondence analysis based on moving regions by which dynamic scene 
information is extracted. Results of applying the algorithms to various image 
sequences are presented in section 0 and finally a conclusion is given. 

2 Motion Detection 

To detect moving objects in an image sequence many methods have been pro- 
posed in the literature. Most algorithms rely on the analysis of pixel based inten- 
sity differences between the images after an alignment step in case of a non-static 
camera. In our approach the images are aligned using perspective flow developed 
by 0 and implemented in 0, where the current background mosaic serves as 
reference. The alignment is based on an underlying motion model describing 
the global motion between both images induced by the active camera. We chose 
a projective transformation, which yields correct global transformation e.g. for 
camera rotation around the optical centers (no translation) and arbitrary static 
scenes, while projections of moving objects result in violations of this model. 
These errors are subsequently detected either computing the average intensity 
difference or the mean magnitude of the local normal flow n(x, y) for each pixel 
{x, y) within a neighborhood y,, as illustrated in equation 0 for the mean normal 
flow N{x,y). Taking neighboring pixels into account yields more robust classifi- 
cation results as the influence of image noise is reduced. 



The classification of moving pixels itself is acomplished thresholding the resulting 
pixel wise motion measure. Thus detection of motion is achieved except for image 
regions where motion does not cause any changes in intensity. However, the 
resulting motion maps are often fragmented and are therefore smoothed applying 
a dilatation operator of size 7 x 7 to the thresholded motion maps. Hence moving 
areas become more compact and small gaps in between are closed. 




(x',y')GA« 



( 1 ) 
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The resulting motion information is used to integrate the current image into 
the mosaic where only static pixels are taken into account yielding a consistent 
representation of the static scene background. Further temporal correspondence 
analysis is based on this data as presented in the next section. 

3 Temporal Correspondence Analysis 

The resulting motion information is sufficient to generate mosaics of the static 
scene background. As a next step we aim at explicitly representing the dynamic 
information of the image sequence contained in the already calculated motion 
maps. Therefore in our approach moving regions resulting from region labelling 
the thresholded and dilated motion maps are tracked and trajectories describing 
their motions are generated to represent the dynamic information. 



3.1 Tracking of Moving Regions 

Tracking is based on moving regions. The matching criterion for tracking will 
now be developed for these regions, but subsequently applied for sets of regions 
lsection l3.2ll . 

Each moving region needs to be characterized by several features which allow 
us to match corresponding regions of consecutive images. When selecting appro- 
priate features it has to be taken into account that due to variances within the 
segmentation process and because of scene events like object decompositions or 
collisions, the moving regions’ shape and size may vary significantly even for con- 
secutive images. Furthermore regions often do not show homogeneous intensity 
values since generally moving regions contain projections from different objects 
or surfaces. Of course, also regions resulting from one object may be of inho- 
mogenous gray values. Due to these limitations features need to be chosen which 
are mainly invariant against scaling and changes in shape and which preserve an 
adequate description of the regions’ pixel values. This is true for the histogram 
of the pixel values and the centroid. Based on these features robust detection of 
correspondences between regions of consecutive images is possible: Two regions 
are considered as corresponding if the distance between their centroids is smaller 
than a given threshold 9d, usually set to 60 pixel, and if the distributions of their 
intensity values are similar. This is checked by computing the overlapping area 
F of the two normalized histograms, whereas the minima of the entries or bi 
in all cells i are summed up: 



F = '^mm{ai,bi) (2) 

i 

The resulting intersection area F is required to exceed a given percentage 9p 
for establishing a match. 9p is usually chosen between 0.75 and 0.85. For robust 
region tracking the sizes A and B of both regions given by the number of pixels 
are compared in addition. However, only if large differences between A and B 
occure a match is rejected due to the region expansion induced by the dilatation 
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operator which has to be taken into account explicitly. The difference in size 
between to areas is regarded too large if the following condition holds: 



\a-b\ 

A + B 



> 0.2 



( 3 ) 



3.2 Tracking Components 

Matching all pairs of moving regions from two consecutive images is not very 
efficient due to the combinatorics. Additionally regions frequently decompose 
or merge in the course of the image sequence induced by variances of the seg- 
mentation results (see e.g. motion maps in fig. EJ or events within the scene. In 
such cases correspondences cannot be established due to significant differences 
in the regions’ histograms or size or too large distances between their centroids. 
Therefore we propose to track connected components instead, which are referred 
to as components subsequently, similarly as in jjj for color segmentation. In [2j, 
regions are considered as neighboring, if they are spatially adjacent and simi- 
lar with respect to their color features. In our case regions are assumed to be 
neighboring, if their minimum point distance is smaller than a given threshold. 
As mentioned, matching of connected components is based on the same criteria 
as developed to match moving regions (section 13.111 where the features can be 
derived directly from the features of the underlying regions. Searching for corre- 
spondences between these components reduces complexity, and variances within 
the segmentation can be handled since it is not required that components contain 
the same number of regions. Rather each component is treated as a single region. 
Using this strategy objects or parts of objects can be tracked robustly in most 
cases. However, in some cases correspondences cannot be established due to a 
changing arrangement of the regions within the components during the tracking 
process. To cope with these situations for each unmatched component all subsets 
of constituting moving regions are considered in a final step. Between all subsets 
of all components from consecutive images the best match is searched iteratively 
where the same match criteria as for components are applied. Matched subsets 
are removed from the sets and search continues until no more matches can be 
established or all given subsets have been matched. 

3.3 Trajectories 

As a result of tracking, correspondences between components of consecutive im- 
ages have been established. To extract the dynamic information, in the present 
context given by the motion direction, trajectories for all tracked components 
are generated. The position of each component in an image is given by its cen- 
troid. A concatenation of the centroids of matched components, and matched 
subcomponents as well, yields a trajectory describing the objects’ motions within 
the image sequence. In the case of translational motion this is sufficient to char- 
acterize the object motion. Future work will focus on a more detailed analysis 
of the trajectories, which may serve as a starting point to apply more sophis- 
ticated motion models. Furthermore they can be used to detect discontinuities 
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of object motion and components incorrectly classified as moving in the later 
case. Trajectory points of these components show little variance which should 
make it possible to distinguish them from moving ones (see figure 0). Finally, 
based on the trajectories a reconstruction of homogenous objects which cannot 
be detected as a whole is possible. Different moving components belonging to 
one object show similar trajectories and could be grouped to reconstruct the 
object. 



4 Results 

The proposed algorithms for detection and tracking of moving objects have been 
tested on various image sequences. All sequences were aquired using an active 
camera which scanned the scene by pan and/or tilt movements. For the first three 
examples presented in this article the mosaics of the static scene background are 
shown along with the reconstructed trajectories. On the left side of each figure 
several images of the underlying sequence are depicted while in the lower half 
motion maps for different points of time within the sequence are shown. The last 
example illustrates the possibilities for false alarm detection and elimination of 
initial object positions (see below) based on trajectory analysis. 




Fig. 1. Results of tracking a person: although the shape of the regions changes (maps 
below the mosaic) the motion trajectory (black line) can be reconstructed correctly 



In figure nthe person’s motion is reconstructed correctly (using intensity dif- 
ference for detection) despite the fact that the shape of the detected regions and 
their sizes vary significantly over time (as can be seen in the motion maps). The 
position of the person can be reconstructed almost completely and no parts of it 
are integrated so that a consistent representation of the static scene background 
results. However, it needs to be mentioned that the initial position of the person 
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remains in the mosaic. It is integrated at the beginning of the sequence because 
it initially belongs to the static background. After the person has left his initial 
position, the beforehand occluded background becomes visible and results in 
large intensity differences at this image region, which consequently is detected 
as a ’’moving” region. Therefore at this position new intensity information can 
never be integrated. However, the centroid of this virtually moving region is 
nearly constant and the analysis of trajectory data is a promising starting point 
to correct this error in the future. 

The second example (fig. EJ shows a mosaic image of a scene containing 
several pens. A pencil enters the scene from the top left corner and stops moving 
after hitting one of the other pens in the scene. Its trajectory (reconstructed 
using normal flow detection) is drawn black whereas the white arrow indicates 
the real motion direction. Especially at the beginning of the tracking process 
the motion given by the trajectory differs significantly from the true one. This 
originates from the fact that initially only parts of the pencil are visible. As new 
parts become visible a displacement of the centroid results and the reconstructed 
translation direction is distorted accordingly. As soon as the pencil is completely 
visible this divergences disappear and the centroids’ displacements are caused 
by real object motions only, allowing a correct reconstruction of direction. 




Fig. 2. Mosaic and reconstructed trajectory (black) of an object within a desk scenario. 
The motion maps show great variances (centroids of moving regions marked grey). 



Figure 0 illustrates the mosaic and extracted trajectories of another desk sce- 
nario. The match box in the upper half of the four images on the left is opened 
and its two parts are moving in opposite directions afterwards. As in the first 
example, the initial box position remains part of the static mosaic whereas the 
following positions are omitted within the integration process. However, the ob- 
ject parts cannot be detected completely computing the intensity difference due 
to lack of contrast between the image background and the object parts. Espe- 
cially the box top moving to the left causes integration errors due to incomplete 
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object detection. Still the reconstructed trajectories describe the object motion 
almost correctly. Even the decomposition of the box (which forces the moving 
regions to be split up multiple times, see motion maps in figure OD is identi- 
fied correctly. Due to the fact that the regions resulting from the decomposition 
are grouped into one component they can be matched to the single region of the 
previous image. With increasing distance, however, they are eventually arranged 
into different components (points of time indicated by white circles within the 
figure). This causes a significant change within their centroids’ positions (white 
arrows). However, the existing correspondences can be detected correctly and 
the scene events represented exactly. 




Fig. 3. Detection and tracking of an object decomposing in two parts 



Concluding, the last example in figure 0 illustrates the former mentioned 
possibilities to detect false classified regions by trajectory analysis. The images 
of the sequence and the related mosaic depict several objects of the construction 
scenario of the SFB 360 where the dark ring in the center of the images moves 
to the right. Its initial position remains part of the mosaic. However, the plot of 
trajectory points at the bottom shows, that variance within these points is quite 
small and should be sufficient to identify this region as false classified. Hence the 
mosaic image can be corrected lateron by integrating local information from the 
current image although the region had been classified as moving beforehand. 



5 Conclusion 

In this paper an approach to generate mosaics of image sequences containing 
moving objects has been presented. A mosaic of the static scene background 
is generated and in parallel trajectories representing the dynamic information 
are extracted. To this end moving regions within the sequence are segmented 
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Fig. 4. The static initial position of the moving ring remains part of the mosaic. How- 
ever, analysing the trajectory points indicates low variance snitable for identification. 




based on pixel wise normal flow or intensity difference between two images. Sub- 
sequently the regions are grouped into sets of topologically neighboring regions 
tracking is based on. In this manner variances within the segmentation process 
and object decompositions can be handled. The components and if necessary 
subsets of them are robustly tracked over time by comparing their intensity 
histograms and centroid positions. Future work will focus on removing initial 
object positions from the mosaics and on detecting false alarms. As pointed out 
trajectory data as computed yield an ideal starting point to solve these problems. 
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Abstract. We present an object recognition system which is capable 
of on-line learning of representations of scenes and objects from natu- 
ral image sequences. Local appearance features are used in a tracking 
framework to find ‘key-frames’ of the input sequence during learning. 
In addition, the same basic framework is used for both learning and 
recognition. The system creates sparse representations and shows good 
recognition performance in a variety of viewing conditions for a database 
of natural image sequences. 

Keywords: object recognition, model acquisition, appearance-based 
learning 



1 Introduction 



Many computer vision recognition systems typically followed Marr’s approach to 
vision in building three-dimensional (3D) representations of objects and scenes. 
The appearance-based or view-based approach, however, has recently gained 
much momentum due to its conceptual simplicity and strong support from stud- 
ies on human perception. In this approach, an object is represented by viewer- 
centered ‘snapshots’ instead of an object-centered 3D model. 

In recent years appearance-based vision systems based on local image descrip- 
tors have demonstrated impressive recognition performance ( [ I I4l2l8j 1 . These 
systems normally work on a pre-defined database of objects (in the case of |H| 
more than 1000 images). However, one problem these approaches have not yet 
addressed is how to acquire such a database. When considering an active agent 
which has to learn and later recognize objects, the visual input the agent re- 
ceives consists of a sequence of images. The temporal properties of the visual 
input thus represent another source of information the agent can exploit ( 0 ). 

In this work, we therefore want to go one step further and move from a 
database of static images to image sequences. We present a recognition system 
which is capable of building on-line appearance-based scene representations for 
recognition from real world image sequences. These sequences are processed by 
the system to find ‘key-frames’ - frames where the visual change in the scene is 
high (related to the idea of ‘aspect-graphs’ from |3I). These key-frames are then 
characterized by local image descriptors on multiple scales which are used in the 
learning and recognition stages. The system uses the same framework for learning 
and for recognition which leads to an efficient and simple implementation. It was 
tested in a variety of different real world viewing situations and demonstrated 
very good recognition performance. 
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Fig. 1. Overview of the learning stage. 



2 Overview 

The system consists of two parts which share the same architecture: the learning 
part and the recognition part. In the learning part (see Fig.l), the input consists 
of an image sequence, which is processed on-line. First, each incoming frame 
is embedded in a Gaussian pyramid to provide a multi-scale representation. In 
the first frame features are extracted, which are then tracked in the subsequent 
frames. Once tracking fails, a new key-frame is found and a new set of features 
is extracted in this key-frame, and the whole process repeats until the sequence 
ends. The model of the sequence then consists of a number of key-frames, which 
contain visual features on multiple scales. The second part is the recognition 
stage: Here, the test image (taken under different viewing conditions) is com- 
pared to the already learned models by using the same procedural steps. Local 
features are extracted and matched against the key-frames in all learned models. 
In the next sections, we describe the two parts of the system in more detail. 

3 The Learning Stage 

3.1 Feature Extraction 

We decided to use corners as visual features, since these were found to be good 
and robust features under many viewing conditions in numerous other works. 
In order to extract corners we modified a standard algorithm m to integrate 
information about all three color channels since this further improved robustness 
in the color image sequences we processed. Corners are found by inspecting the 
structure in a 9x9 neighborhood Af of each pixel in the following way: 
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l^EA-(g,f) E*f(i.S> J 

with < , > as dot-product and I as the vector of RGB- values such that an element 
of H is e.g. H(l,2) = ^ The smaller of the two 

eigenvalues A 2 of H yields information about the structure of the neighborhood. 
We use a hierarchical clustering algorithm to cluster the values of A 2 into two 
sets and use the one with the higher mean- value as the feature set. Using a 
clustering algorithm has the advantage that one does not need to specify a hard- 
coded threshold for the values of A 2 for each image. 

The whole procedure results in about 200 features over all three scales. 

3.2 Tracking Visual Features 

After extraction of feature positions, these features are then tracked over the 
image sequence. For this purpose, we are looking for corners in the image at 
time t -|- 1 in the vicinity of corners found at time t. The actual matching of 
the corners is done with an algorithm that is based on |7] (where it was used 
for stereo matching) and pioneered by |S|, which we shortly summarize as it 
provides an elegant solution to feature matching and is also used later on in the 
recognition stage. 

The algorithm constructs a matrix A with each entry A(i,j) defined by 

-^(l-NCC(fi,t+i,fj,t)) 

A(z,j) = e diat . e Ncc 

with fi^t+i is the position of a corner at time t + 1 and fj^t another corner at 
time t in the previous image and i,j indexing all feature pairs in the two images. 

The first term measures the distance (dist) from feature i to feature j with 
o’dist set to small values (in our implementation (Tdist = 8, 16,32 pixels for each 
of the three scales of the Gaussian pyramid) thus giving a tendency toward close 
matches in distance. The second term measures the normalized cross-correlation 
(NGG) of the neighborhoods from features i and j, with ctncc set to a value 
of 0.4, thus biasing the results toward features that are similar in appearance. 
Based on the Singular Value Decomposition of A (A = U- V-W’^), the matching 
algorithm uses the modified SVD of this matrix, defined by A' = U • I • W'^ 
where I is the identity matrix. Features are matched if they have both the 
highest entrance in the column and row of A' and if the NGG exceeds a given 
threshold (in our case </incc > 0.6). This method effectively provides a least- 
square mapping and at the same time ensures that there is a one-to-one mapping 
of feature^ 

The resulting algorithm is capable of tracking under affine feature transfor- 
mations between two images ensuring reliable and flexible tracking even for larger 
feature displacements between frames. Tracking could thus even be implemented 
in real-time since for normal camera and object velocities the tracker does not 
need to run at full framerate as it is able to cope with larger displacements. 

^ Note, that the feature mapping can occur between sub-sets of features. 
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The tracking procedure is followed until more than 75 percent of the initial 
features are lost at any one scale. This type of ‘tracking through survival’ effec- 
tively excludes false matches since these tend not to be stable across the whole 
image sequence. Once tracking fails, a new key-frame is inserted, new features 
are extracted and the process repeats until the sequence ends. 




a) sflowg - 12 frames b) carOl - 8 frames c) osu-1 - 7 frames d) desk - 1 1 frames 



Fig. 2. Examples of feature tracking. The first (brightness reduced) frame of the 
sequence together with trajectories of the tracked features from all scale levels is shown. 

We tested the performance of the tracker using various sequences available on 
the internet and also using sequences taken with a digital video-camera; some re- 
sults are shown in Fig. 2. All sequences are color sequence^ Figs. 2b, d are taken 
with an off-the-shelf camcorder. Fig.2a,d primarily show translational camera 
movement, while Fig.2b,c show mixtures of both translational and rotational 
components. In addition, the shaky path of the hand-held camera is clearly visi- 
ble in Fig. 2b, d. In Fig. 2c the head makes a complicated movement while getting 
nearer to the camera, which is again accurately captured by the tracking algo- 
rithm. This example also shows the capability of the algorithm to track features 
even under non-rigid transformations (facial movements on static background) 
in the image. 

In the end, the model of the sequence consists of a number of key-frames 
with features at several scales. We then save the pixel values in a 11x11 pixel 
window around each feature point which could be tracked between key-frames. 
Depending on the amount of visual change in the sequence this approach thus 
results in a high compression rate since we are using only a small amount of 
local features but still retain some actual pixel information. 

Fig. 3 shows examples for key-frames for three sequences of our database of 
car sequences. The database consists of 10 video sequences of cars which were 
recorded by walking with a video-camera around a car under daylight lighting 
conditions. No effort was made to control shaking or distance to the car. For all 
experiments we took only four frames per second from the sequences resulting 
in about 70-80 frames for the whole sequence. Note that the models consist of 
similar amounts of key-frames and that the viewpoints of the key-frames are 
roughly the same. 

^ The sflowg and osu-1 sequences are available at 
http: //sampl . eng . ohio-state . edu/'sampl/dat a/mot ion. 
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Fig. 3. Key-frames for three car sequences. 

4 Recognition of Images 

For recognition we use the same framework as for learning. For this we change 
the parameters of the tracking algorithm to allow for greater image distances 
with Udist = 40, 80, 160 pixels while at the same time also increasing ctncc and 
^^NCC to 0.7. This ensures that only features with a high support both from 
global feature layout and from local feature similarity will be matched. Since we 
are using only a small amount of features («40 over all scales), matching can be 
done very fast for all key-frames even in a larger database. 

In the following we will describe two types of recognition experiments. The 
first experiment was done with a variety of image degradations to test the stabil- 
ity of the system under controlled conditions - for this, we also used the sflowg, 
osu-1 and desk sequences. For the second experiment we took pictures of five 
cars with a digital camera under a variety of different viewing conditions such 
as different background, different lighting, occlusion, etc. 

4.1 Recognition Results 

Experiment 1 (image degradations): We used 7 types of image degrada- 
tions, which are listed in Table 1. Two random images out of 14 test sequences 
were degraded and then matched against all other images in the sequence; for 
this, matching was done on all three scales and the mean percentage of matches 
recorded. Also shown in Table 1 is the mean percentage of the best matching 
frame and the recognition rate for 28 test images. Recognition failures occurred 
only in the shear, zoom and occlusion conditions, where feature distances were 
sometimes closer to neighboring frames due to the geometric transformations. 
To demonstrate the limits of the feature-based approach, we randomly superim- 
posed 12xl2pixel squares in the occlusion condition over the images so that 15% 
of the image was occluded. This led to a drastic reduction in recognition rate 
due to the almost destroyed local image statistics. Fig. 4a shows one correctly 
recognized test image from condition 8. 

Experiment 2 (recognition of novel images): We took 20 more images of 
five of the cars with a digital camera (and thus with different camera optics) 
in different viewing conditions to test how the system behaves in more realistic 
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Table 1. Image degradations Table 2. Recognition of novel images 



type 


percentage 

matched 


recognition 

rate 


condition 


percentage 

model 


percentage frame 
(false matches) 


recognition 

rate 


1. brightness +60% 


66.4% 


85.7% 


1. lighting 
changed 


7.5% 


22.3% (9.4%) 


80% 


2. contrast +60% 


65.6% 


85.7% 


3. noise +40% 


57.6% 


92.9% 


2. viewing 
distances 


9.3% 


24.8% (8.2%) 


90% 


4. equalized color 


53.3% 


85.7% 


5. shear 


31.9% 


78.6% 


3. occlusion 


10.2% 


35.2% (9.2%) 


100% 


6. zoom X 1.5 


32.9% 


82.1% 




7. occlude 15% 


5.8% 


46.4% 


8. all of 1,2, 3, 4, 5 


28.3% 


75% 



conditions. These included changes in lighting conditions (Fig. 4b), two differ- 
ent viewing distances (Fig. 4c) and occlusion by other objects (Fig.4d). Table 2 
lists the percentage of matched features for the best matching model, which was 
obtained by summing up match percentages for each key-frame and dividing 
by the number of key-frames. A closer analysis of the match percentages shows 
that nearly all the matches are concentrated around the best matching key- 
frame, which demonstrates the robustness of the matching process. This is also 
shown by the much higher percentage of matched features for the best match- 
ing key-frame in the second column. The percentage of false matches, which we 
determined by visual inspection, is also indicated in Table 2 in brackets. Mis- 
matches were mostly due to similar features on the background in the testing 
framed. The nearest key-frame was chosen for 18 out of 20 frames even with 
sometimes drastically different backgrounds. However, in all cases the test image 
was assigned to the correct model and in only one case was the best matching 
frame more than two frames away from the right key-frame. 

Some test images are shown in the upper row of Fig.4b-f together with their 
best matching key-frame in the lower row. 




Fig. 4. Recognition results with a) degraded and b)-f) novel images. Test images are 
depicted in the upper row. 



® Note, that a false match rate of 10% in most cases means not more than 2 false 
matches. 



222 



C. Wallraven and H. Biilthoff 



5 Conclusion and Outlook 

We have presented a recognition system which is capable of on-line learning of 
sparse scene representations from arbitrary image sequences. It uses a simple and 
efficient framework both for learning and recognition. Recognition performance 
on a small database (which we plan to extend also to other object classes) showed 
that even this simple implementation is capable of recognition under a variety of 
different viewing conditions. Approaches such as El for feature characterization 
could even further improve the recognition results. Another by-product of recog- 
nition by key-frames is an approximate pose-estimation which could for example 
be beneficial for active agents in navigation and interaction with objects. 

An extension of this system would be to use the feature trajectories in between 
key-frames in a structure from motion framework to recover a (necessarily) coarse 
3D structure of the scene. This could further help with recognition of objects 
under larger changes in object orientation in the scene. 

Since the system uses no segmentation, the resulting models will include 
information about the object and the surroundings. In an active vision paradigm 
an attention module could actively select the features belonging to one particular 
object, while the others are discarded and not tracked. In addition, we want to 
move the system from closed-loop (i.e. presenting pre-cut sequences for learning) 
to open-loop, where it decides autonomously when to begin a new model. For 
this, we are currently investigating the use of other types of features (such as 
color or texture - see e.g. 0), which can be evaluated on a more global scale. 
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Abstract. Here we describe a new method for the quantification of a 
global NOa: budget from image sequences of the GOME instrument on 
the ERS-2. The focus of this paper is on image processing techniques 
to separate tropospheric and stratospheric N02-colums using normal- 
ized convolution with infinite impulse response filters (HR) to interpolate 
gaps in the data and average the cloud coverage of the earth, the esti- 
mation the NO 2 life time and the determination of regional NOa, source 
strengths. 

1 Introduction 

In April 1995 the ERS-2 satellite was launched by the European Space Agency 
(ESA). The satellite carries besides other instruments the Global Ozone Mon- 
itoring Experiment (GOME), an instrument that allows to measure trace gas 
concentrations in the atmosphere using the Differential Optical Absorption Spec- 
troscopy (DOAS) 15 . 




Fig. 1. Scanning geometry of the GOME instrument and visualization of the measured 
slant column densities (SGD). 

The GOME instrument consists of four spectrometers which monitor the 
earth in nadir view with a ground pixel size of 320x40 km. It records a spectrum 
every 1.5 seconds and needs about 3 days to completely scan the whole earth. 
Measured trace gas concentrations are integrated along a light path through the 
whole atmosphere (see Fig. ^1. Since these paths are usually not vertical they are 
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called slant column densities (SCD). To obtain column density data which are 
independent from the viewing geometry the SCDs are transformed into vertical 
column densities (VCD) using the so called Air Mass Factor (AMF), which is 
in particular dependent of the solar zenith angle (SZA), ground albedo and the 
cloud cover. Evaluation of the DO AS spectra is done using a nonlinear fit algo- 
rithm to match reference spectra of trace gases against the measured spectrum 
and hence determine the specific concentration. |5j made major improvement 
regarding to evaluation speed, using B-spline interpolation |S| and reducing the 
number of iteration steps, so long time analysis can be done in acceptable time. 

To estimate NO^source strengths at first a separation of the tropospheric 
and stratospheric NO2 concentrations as shown in Section El has to be done. 
Further more in Section El a new technique to determine the NO2 life time is pre- 
sented. Combining these results in Section E]a estimation of regional NOa,source 
strengths can be done. 



2 Separation of Stratosphere and Troposphere 

First step in the subsequent analysis is the separation of the stratospheric and 
the tropospheric contribution to the total NO2 column as we are only interested 
in the tropospheric NO2 emissions. 

Fig. 2 (a) shows the typical distribution of the vertical column density (VCD) 
of NO2 for 9 September 1998. On this map the basic assumptions to discriminate 
troposphere and stratosphere can be observed: 

The stratospheric contribution to the total column varies on a much larger 
scale than the tropospheric fraction, due to the longer life time of nitric oxides in 
the stratosphere and the fact that the tropospheric emissions are mainly caused 
by punctual emissions of industrial sources or biomass burning events. 

It can be observed that the stratospheric distribution is less variable in lon- 
gitudinal direction than in latitudinal direction where apparently a typical lat- 
itudinal profile is established. This is mainly due to the wind system in the 
stratosphere. 

As described in |S| clouds hide parts of the visible NO 2 column which lay 
below the cloudy layer (in the troposphere). As a result the NO2 column observed 
over cloudy regions will mainly consist of stratospheric contributions. If such 
a condition happens to appear over the ocean the effect becomes even more 
important since due to the low earth albedo the tropospheric contribution is 
attenuated relative to the stratospheric one. Therefore cloudy pixels over sea 
should represent the stratospheric NO2 column. 



2.1 Cloud Detection 

These considerations will now be exploited to estimate the stratospheric NO2 
distribution. First land regions are masked out to avoid errors in the strato- 
spheric signal due to tropospheric contributions. On the remaining pixels the 
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stratospheric proportion of the total column will dominate. For further limi- 
tation cloud free pixels are masked out as well. This is done using the cloud 
detection algorithm introduced by 0 which is able to deduce a cloud fraction 
for each GOME pixel. Using a threshold we segment pixels with a cloud frac- 
tion of at least 50% and apply the resulting mask on the trace gas map. This 
procedure yields an image of NO2 column densities with pixels that very likely 
represent the stratospheric NO2 column but contain large gaps due to the mask- 
ing process. 



2.2 Interpolation by Normalized Convolution 

These gaps have to be interpolated in order to estimate the stratosphere. So for 
this purpose the procedure of interest is the concept of Normalized Convolution 
by m. If g denotes the original image, where g(x,y) is the NO2 concentration, 
m the mask for each pixel in the intervall [0, 1], which is given by l/cg and is 

0 for the gaps, then the interpolated image g' is given by: g' = , with 

the lowpass filter operator B which can be of different size in any coordinate 
direction. The advantage of this filter type lies in its efficiency and the fact 
that it combines interpolation and averaging. The resulting numerical errors are 
dependent of the size of the gaps and are typically between 3.0% and 20.0%, 
the resulting stratospheric NO2-VCDS range from 1-10^^ to 4.1915 sM^mIss ^ggg 
Fig. El). 

For B we chose the HR {infinite impulse response) Deriche filter (see P): 
B{x) = k{a\x\ + with a: £ Z the coordinate of one dimension, stan- 

dardisation k and a SlR-'^ as a steering parameter. This filter is similar to a 
Gaussian filter but has the advantage of recursive implementation. This is ex- 
tremely important because the use of FIR {finite impulse response) filters on a 
3D image (latitude, longitude and time) is very time consuming. The Deriche 
filter is a second order recursive filter with causal and anticausal proportions 
{g'{x) = B{x) = gi{x) + g2{x)) : 

gi{x) = k{g{x) + e~°‘{a - l)g{x - 1)) -k 2 e~°‘gi{x - 1) - e~^°‘gi{x - 2 ) (1) 

' V ^ ' V ^ 

FIR HR 

g2{x) = kg{x + l)(e““(a -k 1) - e~^°‘) + 2e~°‘g2{x -k 1) - e“^“gi(x -k 2) (2) 

' V ' ' V ' 

FIR HR 

This means 8 multiplications and 7 additions per pixel and dimension for a 
given steering parameter a. The different variability of the NO2 maps in each 
direction can be exploited by the use of different values for a in latitudinal and 
longitudinal direction. 

The result of the interpolation is now defined as representing the stratospheric 
background of the total NO2 column (see Fig. EJ. The tropospheric contribution 
can now be estimated by forming the difference between the original image and 
the estimated stratosphere. The resulting image has to be multiplied by a correc- 
tion image which is the quotient of the stratospheric and tropospheric air mass 
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Verical Column Density, 15.9.1998 



Estimate of the Stratosphere, 15.9.1998 



Vertical Column NOj [10’“' molec/cm2] 



Vertical Column NOj [10’® molec/cm2] 



Calculation of the Difference 



Vertical Column NOj [10’® molec/cm2] 



Fig. 2. Visualization of the discrimination algorithm. Starting with map (a) we mask 
out land masses and cloud free pixels and estimate the stratosphere (b) using Normal- 
ized Convolution. The tropospheric residual (c) can then be estimated calculating the 
difference between (a) and (b). It shows pronounced maxima in industrialized regions 
(please note the different scales of the maps). 

factors and these factors are dependent on the actual solar zenith angle and 
the ground albedo. In the resulting image Fig. B(c) we see that localized emis- 
sion sources appear pronounced, whereas the global stratospheric trend could be 
suppressed nearly completely. 

Additional corrections have to be made in order to convert the NO 2 concen- 
trations in NOx concentrations and to consider the influence of clouds. The first 
is implemented using the Leighton ratio which is dependent of the Ozone con- 
centration [O3], the N02-photolysis frequency J(N02) and the rate coefficient 
of the reaction of O3-I-NO2, k = 1.1 ■ 10“^^ -exp( ~ The factor which 

corrects for the influence of clouds is determined statistically by analyzing the 
annual average of NO2 concentrations for different cloud covers. 

3 Estimation of the Mean NO 2 Life Time 

For the determination of the mean NO2 source strength from the tropospheric 
maps the knowledge of the NO2 life time t is necessary. Fig. 3 (e) shows increased 
NO2 columns over the ocean near coasts in wind direction whereas in the opposite 
direction no such feature can be found. This behavior is due to the chemical decay 
of NO2 over the ocean where there is no net production but only chemical decay 




Image Sequence Analysis of Satellite NO 2 Concentration Maps 227 



Sun Zenith Angle 



Albedo 




Albedo [%] 



Tropospheric Residual 



Correction Factor 



160 II 190 
Correction [%] 



-5 0 5 10 

VCD NO 2 [10’^ molec/cm2] 



VCD NO 2 [10’^ molec/cm2] 



Tropospheric Residual with 
stratospheric Air Mass Factor 



Fig. 3 . Illustration of the correction process for the tropospheric residual with the 
tropospheric Air Mass Factor. The correction factor (c) is calculated from the SZA 
(a) and the ground albedo (b) which we estimate directly from the GOME PMD 
(Polarization Monitoring Device, part of the GOME Spectrometer) data. Applying 
(c) to the original tropospheric residual (d) we gain the corrected tropospheric NO 2 
column in (e). 




Fig. 4. Example for the decay curve along a latitudinal section through a NO 2 plume 
at the eastern shore of the US. The 1/e width of the curve can be calculated by a 
nonlinear regression with an exponential function. Combined with the average wind 
speed (6.0 ± 0.5m/s) an average N02-lifetime of 27±3 hours is derived. 
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(see Fig. 4). From this decay curve we estimate the mean NO 2 life time from a 
case study in North America. 

The basic assumption for the method is that in the annual mean the wind 
speed u and the life time r can be substituted by their mean values and that the 
chemical decay can be described by a linear model (first order decay) . The decay 
curve of NO 2 then shows a static behavior and can be expressed by the following 
equation (in a:-direction) : ^ = 0 = + with the NO 2 concentration c. 

From integration it follows directly: c(x) = Cgexp The mean NO 2 

life time r over one year can thus be calculated by the determination of the - 
width of the decay curve. 

From a nonlinear fit the - width is found to be Xe = 670 ± 75 km. This fit has 
been done for every dataline of the marked area (see Fig. 4) and the variation of 
the results has been used to determine the error for Xe- The average wind speed 
was = 6.8 ±0.5 ™ for th® lower troposphere (from ftp://zardoz.nilu.no). 
From this information we calculate a resulting life time r of r a; 98000 s = 
27 ± 3 h which corresponds well to values found in literature of approximately 
one day. 

4 Estimation of the Mean NOa, Source Strength 

The estimation of the source strength (production rate) A can now be done from 
the data of the annual mean image of the tropospheric NO 2 residual considering 
the correction factor mentioned above. Assuming that the production rate is 
constant over the year as well as the life time r. The temporal development of 
the NO 2 concentration is then described by 

^ = X-^c^CT = ^J^ c{t)dt= ^{tT + T‘^{e~^ -1)) Ki Xt (3) 

with the mean concentration ct for the time period T. By comparison with the 
measured values ct (see Fig. 3 (e)) and the life time r from 0 we can now 
estimate the global nitrogen budget and the mean production rate A over one 
year. 

Results of the determination of the global source strength and nitrogen bud- 
get are presented in Fig. 5. It shows that the global emission strength derived 
with this algorithm is (48 ± 20) TgN yr“^, which is in good agreement with ref- 
erence values from literature (All values are in units of [Tg N yr“^]): Logan 
et al. (JGR 88:10,785-807) 1983: 25-99, IPCC Report Cambridge 1990: 16-55, 
Hough (JGR 96:7325-7362) 1991: 42.0, Fenner et al. (JGR 96:959-990) 1991: 
42.2, Mtiller (JGR 97:3787-3804) 1992: 21.6, IPCC Report Cambridge 1992: 
35.1-78.6, IPCC Report Cambridge 1995: 52.5, Olivier et al. Rep. no. 771060002 
Bilthoven 1995: 31.2 and Lee et al. (Atm.Env.31(12):1735-1749) 1997: 23-81. 

Whereas the absolute values contain still uncertainties of approximately a 
factor of 2 at least their relations show the order of the regions emitting most 
nitrogen. It is shown that Africa emits most nitrogen though it is only a poorly 
industrialized region. This is most likely due to biomass burning and soil emis- 
sions. A more detailed description of the used algorithms and interpretation of 
the results can be found in jO]. 
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Fig. 5. Estimate of the mean NOa;-burden (in 10®g nitrogen) and emission (in lO^^g 
nitrogen per year and in kg nitrogen/km^ and year) for 1997 for different parts of the 
world. 



5 Conclusions 

Our new method to discriminate the tropospheric and stratospheric NO 2 contri- 
butions to the vertical NO 2 column from GOME satellite data only uses intrinsic 
image information and can thus be applied self consistently to each global NO 2 
map of an image sequence (see Section |2|). From the tropospheric NO 2 images 
information about the mean NO 2 life time can be obtained by analyzing its de- 
cay behavior in coastal regions (see SectionEj). The estimated life time of 27±3 h 
corresponds very well to values found in literature of about one day. This leads to 
a first estimation of the mean NO^budget for the year 1997 and the calculations 
also will be done for the following years. The errors can be decreased by better 
knowledge about clouds and aerosols which allow a more precise calculation of 
the air mass factors. Though the errors are relatively high this method is a good 
alternative to the traditional ground based measurements. 

We realize that in each of the evaluation steps described above substantial 
improvements are possible and necessary. While work is under way in our insti- 
tution to make these improvements, this manuscript is focussed on presenting 
preliminary results which, nevertheless, clearly demonstrate the power of this 
new technology, and to make them available to the scientific community. Ex- 
amples, where uncertainties and systematic errors can be significantly reduced 
include better characterization of the effect of clouds and aerosol on the tropo- 
spheric NO 2 column seen by GOME. Also the estimate of the NO^;- lifetime can 
clearly be improved beyond our simple assumption of a global average value. 
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Abstract. We show a framework for growth analysis of plant roots in 
object coordinates which is one requirement for the botanical evaluation 
of growth mechanisms in roots. The method presented here is appliable 
on long image sequences up to several days, it has no limit for the se- 
quence length. First we estimate the displacement vector field with the 
structure tensor method. Thereafter we determine the physiological coor- 
dinates of the root by active contours. The contours are first fitted on the 
root boundary and yield the data for the calculation of the middle line as 
the object coordinate axis of the root. In the third step the displacement 
field is sampled at the position of the middle line and projected onto it. 
The result is an array of tangential displacement vectors along the root 
which is used to compute the spatially resolved expansion rate of the 
root in physiological coordinates. Finally, the potential of the presented 
framework is demonstrated on synthetic and real data. 



1 Introduction 

Plant expansion growth can be mapped to a high spatial and temporal resolution 
by using optical flow techniques [Schmundt and Schurr, 1999], [Schurr et ah, 
2000]. However, not only the magnitude, but also the location of growth zones 
on the biological object are required. Therefore it is necessary to project maps 
detected by low-level analysis to feature-based coordinates of the plant. 

In the growing root the relevant physiological coordinate axis is the middle 
line originating at the root tip: cells that have been formed at the root tip expand 
along this axis. 

The common method for measuring root growth is, to put ink markers or 
other tracers on the root [Beemster and Baskin, 1998] [Walter et al, 2000]. The 
growth rate (GR) is then determinded by observing the change of their position 
over time. This technique has only a coarse temporal and spatial resolution, 
the number of markers is limited and it is very inconvenient to evaluate the 
sequences. 

Root growth occurs only in the region near the root tip [Walter et ah, 2000]. 
Therefore we only observe this area. Optimal resolution is obtained in a setup, 
when most of the image is taken by the root. However, then the expansion 
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process moves the root out of the imaged frame. For long term observations 
(over several days) we follow the tip by moving the camera on an x-y-stage. For 
studying diurnal rhythms during light and dark an illumination in the near IR 
is used. 

The described algorithm has three components. At the outset the displace- 
ment vector field on the root (vx,Vy) is calculated in image coordinates by the 
structure tensor method. Afterwards the object coordinate axis is extracted by 
an active contour model (sec. Q|). The last step projects the displacement vector 
field on the coordinate axis (sec. n. This results in the tangential displacements 
Vt along the root axis. We get the growth distribution by computing the diver- 
gence of Vt- In sec. |51we apply the method on real and synthetic data. 



2 Displacement Vector Field 



We calculate the displacement vector field (DVF) with the so called structure 
tensor method. The assumption that any brightness change is due to movement 
directly guides to the optical flow constraint equation [Horn and Schunk, 1981]. 



d^ _ dgdx dg dy dg _ ^ A „ 
dt dx dt ^ dydt^ dt ^ 



( 1 ) 



where we use = (^,^,^), v'^ = (^,^,1) = {vx,Vy,l) and g as grey 
value. For further details we refer to [Haufiecker and Jahne, 1997, HauBecker 
and Spies, 1999]. 

The structure tensor method provides a confidence measure C which indi- 
cates the positions of a possible displacement estimation (the DVF can only be 
estimated at regions with sufficient texture). In the later evaluation is taken 
into account by computing a fillfactor Isec. 15.211 . This is the ratio of the number 
of reliable estimations with C ^ and the number of pixels which belong to 
the root. 

This step yields the DVF of the root which is later sampled at the middle 
line positions and then used to calculate the growth distribution. 



3 Physiological Root Coordinates 

The aim of this work is to do biological meaningful measurements of root growth. 
For this purpose one requirement is, to calculate the growth rates in a physio- 
logical coordinate system. As cells emerge at the very root tip and expand along 
the root axis, the distance from the root tip, measured along the middle line of 
the root, defines the coordinate axis. The determination of this line as a b-spline 
is done as follows: The b-spline representation is a continuous mathematical de- 
scription of the curve. The run of the b-spline is defined by a number of fulcrums 
and one can pass along the line by varying a parameter value continuously. The 
entire numbered indexes correspond to the fulcrum positions of the b-spline. 
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Fig. 1. a Original root image b Preprocessed image with fitted root boundary, normals 
for position estimation of the boundary and extracted middle line. 



In a first step, the root boundaries are extracted and represented by an active 
contour method (a b-spline with the ability to fit its position to image contents 
[Blake and Isard, 1998]) and used in the second step to determine the middle line 
as another b-spline 1 sec. 13.211 . Subsequently we sample the DVF at the positions 
of the middle line at distances of one pixel. As we know the tangent direction 
of a b-spline at any position, we can easily project the DVF on the middle line 
(sec. ® . 

3.1 Root Boundary 

To extract the boundary of the observed root by an active contour we first pre- 
process the sequence to prevent that the contour is pulled off the root boundary 
by e.g. air bubbles in the image (see Fig^). A global threshold is applied and 
the biggest segmented area is used as the root mask. Applying this mask on the 
original image removes all but the root. 

A pixel line of the boundary is obtained by eroding the root mask by one pixel 
and subtracting the erosion result. Along this pixel line we take 45 evenly spaced 
points as initial fulcrums for the active contour. The active contour adjusts itself 
to the root image by replacing each fulcrum position by the position of the 
maximal gradient in normal direction to the spline (see FiglOD ). At the position 
where the spline is too far away because of missing fulcrums, new ones are 
introduced. 

The above procedure is applied to each image of the sequence and gives us 
the root boundary in the form of a b-spline Sb which is used in the next section 
to extract the middle line. 

3.2 Middle Line 

The b-spline Sb of the root boundary is now used to get the middle line of the 
root in form of a b-spline Sm- We search for fulcrums of the middle line Sm by 
applying the following method to each fulcrum of the boundary Sb- 
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First originating from the fulcrum position we search in normal direction 
to Sb for the root boundary on the other side of the root (see Fig. 03 for the 
normals). The other side of the root is detected by mirroring the gradient filter 
which is applied on the normals of Sb for finding the exact root boundary in 
o The convolution returns a maximum at an edge in the other direction, 
that means on the other side of the root. In the middle of the new point and its 
corresponding fulcrum of Sb the new fulcrum of the middle line Sm is positioned. 

In the region of the root tip this procedure gives no result because the nor- 
mals on which the other side of the root is searched do not intersect with the 
root boundary (see Fig. GJd). Here we extrapolate Sm linearly and intersect the 
boundary Sb with this extrapolation to obtain the fulcrum of the middle line. 

Now the axis of the physiological root coordinates is available as a b-spline 
Sm, thus a continuous mathematical description with its direction at every po- 
sition which is needed for the projection of the DVF on Sm- 

4 Projection of the DVF 

At this point we combine the DVF as calculated with the structure tensor (sec- 
tion 0 and the middle line Sm extracted with an active contour (section 0 in 
each image of the sequence. 

In order to separate movements of the whole root from motion caused by 
growth we project the DVF on Sm- The projection of the DVF is calculated 
at intervals of 1 pixel starting from the root tip. This way we get an array of 
the tangential displacements along Sm- At each position the following steps are 
done. 

First we determine the indices of Sm of the sampling positions. There a 
bilinear interpolation of the DVF is done. For stabilization the DVF is averaged 
in normal direction to Sm over 21 positions. This value is finally projected on 
Sm to get the tangential displacement Vt(x)) of the root. 

This yields an one-dimensional array for each image which contains the tan- 
gential displacements Vt(x) along the middle line of the root at intervals of 1 
pixel. The relative GR distribution of the root g(x) is computed as the slope of 
this array for each time step. 
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5 Experiments 

The presented framework was tested with artificial and real data to show its 
accuracy and suitability for real botanical measurements. 

5.1 Synthetic Data 

The synthetic data on which we apply our method was made to estimate its 
accuracy at different speeds of the root tip and the influence of noise. The artifi- 
cial root (see Fig. EJ consists of gaussian spots which are displaced in x-direction 
with 0.5, 1.0 and 1.5 pixel/frame at the root tip and a sinusoidal decrease (see 
Fig.01i,c,e). Normal distributed noise with a standard deviation of 1, 2 and 3 
grey values was added on the 8 bit images, which approximately corresponds to 
the noise level of the applied cameras. The estimated displacements are shown 
in Fig. 12^ , c and e , they are systematically lower than the real displacements. 
The maximum estimation error of the displacement occurs at the position of 
high displacement decrease because there the assumption of a homogene dis- 
placement is violated. Nevertheless the resultant GR distribution is close to the 
real one. This resolution fullfils the requirements of the botanical application. 




Fig. 2. Synthetic image, a original b preprocessed with extracted boundary, position 
estimation normals and middle line. 
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Fig. 3. Results of the synthetic data at different root tip displacements and noise levels. 
Gaussian noise with a standart deviation cr of 1, 2 and 3 grey values was added on each 
sequence. The estimated DVF/ divergence is only shown for noise with a — 1. For the 
other noise levels only the estimation error is shown a DVF at a tip displacement vt 
of 0.5 pixel and errors for different noise / frame b Divergence Vt = 0.5 pixel / frame 
cDVF at a tip displacement Vt of 1.0 pixel / frame d Divergence Ut = 1.0 pixel / frame 
eDVF at a tip displacement vt of 1.5 pixel / frame f Divergence Vt = 1.5 pixel / frame 
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5.2 Real Data 

The real data was recorded at the Botanical Institute of the University Heidel- 
berg. The images were taken on maize roots in intervals of 5 minutes with a 
resolution of 640x480 pixel, the middle line was extracted (see Fig.^, b). As 
the results scatter rather strongly (see Fig. 0: ) we compute an average DVF 
(see Fig.Eli). For this purpose each DVF is rejected which has a fillfactor (sec. 
0 less than 0.2. 




Fig. 4. a Recorded image of a maize root, b Preprocessed image with extracted bound- 
ary and middle line, c Diagram of t (y-axis) over the projected DVF v{x)t x-axis). The 
root tip is located on the left. High resp. low displacements are represented bright resp. 
dark. The length of the DVF increases as the root grows. The step in the middle is 
caused by the camera tracking of the root tip, where no displacement estimation is 
possible, d Averaged DVF v{x)t and spatial resolved growth rate distribution of 

the root. The root tip is located in the origin. 



6 Conclusion and Acknowledgements 

We show a framework for root growth measurements in physiological coordi- 
nates. We combine the estimation of the DVF with a technique that samples the 
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estimations along the middle line Sm of the root and projects the DVF on Sm- 
This enables biologists to measure directly in the relevant coordinate system. In 
future work this method will be extended to more complicated coordinate sys- 
tems such as the vein system on plant leaves. Part of this work has been funded 
under the DFG research unit ’’Image Sequence Analysis to Investigate Dynamic 
Processes” (For240). 
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Abstract. In the last few years the research in 3-D object recognition has focused 
more and more on active approaches. In contrast to the passive approaches of the 
past decades where a decision is based on one image, active techniques use more 
than one image from different viewpoints for the classification and localization of 
an object. In this context several tasks have to be solved. Eirst, how to choose the 
different viewpoint and how to fusion the multiple views. 

In this paper we present an approach for the fusion of multiple views within 
a continuous pose space. We formally define the fusion as a recursive density 
propagation problem and we show how to use the Condensation algorithm for 
solving it. 

The experimental results show that this approach is well suited for the fusion of 
multiple views in active object recognition. 

Keywords. Active Vision, Sensor Data Eusion 



1 Introduction 



Active object recognition has been investigated in detail recently 14181117121 . The main 
motivation is that recognition can be improved if the right viewpoint is chosen. First, 
ambiguities between objects can be avoided that make recognition difficult or impossible 
at all. Second, one can prevent to present views to the classifier where in the mean worse 
results are expected. Those views depend on the classifier and can be recognized right 
after training, when the first tests are performed. 

One important aspect in active object recognition — besides the choice of the best 
viewpoint — is the fusion of the classification and localization results of a sequence of 
viewpoints. Not only for ambiguous objects, for which more than one view might be nec- 
essary to resolve the ambiguity (examples are presented in the experimental sections), 
the problem arises how to fuse the collected views to finally return a classification and 
localization result. Also a sequence of views will improve recognition rate in general if a 
decent fusion scheme is applied. In this paper we present of a fusion scheme based on the 
Condensation algorithm o . The reason for applying the Condensation algorithm is 



* This work was partially funded by the German Science Foundation (DEG) under grant SFB 
603/TP B2. Only the authors are responsible for the content. 
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object ol object o2 
gun trumpet 




object o3 
lamp 




Fig. 1. Examples of the seven toy manakins used for the experiments. Please note that objects o4 
to o7 cannot be classified with one view due to the complex ambiguities 



threefold: first, inherently one has to deal with multimodal distributions over the class 
and pose space of the objects. Second, moving the camera from one viewpoint to the 
next will add uncertainty in the fusion process, since the movement of the camera will 
always be disturbed by noise. Thus, in the following fusion process of the classification 
and localization results acquired so far with the results computed from the current image, 
this uncertainty must be taken into account. Third, it is not straight forward to model 
the involved probability distributions in closed form, especially if multiple hypothesis, 
i.e. multimodal distributions, shall be handled. These three aspects let us believe, that 
the Condensation algorithm is perfectly suited for the fusion of views in active object 
recognition. Especially, the ability to handle dynamic systems is advantageous: in view- 
point fusion the dynamics is given by the known but noisy camera motion between two 
viewpoints. 

In the next section we summarize the problem and propose our sensor data fusion 
scheme based on the Condensation. The performed experiments and an introduction to 
the classifier used in the experiments are resented in SectionEIto show the practicability 
of our method. Finally, a conclusion is given in Section^ 



2 F usion of Multiple Views 

In active object recognition object classification and localization of a static object is based 
on a sequence or series of images. These images shall be used to improve the robustness 
and reliability of the object classification and localization. In this active approach object 
recognition is not simply a task of repeated classification and localization for each 
image, but in fact a well directed combination of a funded fusion of images and an active 
viewpoint selection. 

This section deals with the principles of the fusion of multiple views. Approaches 
for active viewpoint selection will be left out in this paper. They have been presented in 
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2.1 Density Propagation with the Condensation Algorithm 

Given an object, a series of observed images /„_i, ■ ■ ■ , /o and the camera move- 
ments a„_i, . . . , Oq that lead to these images, one wants to draw conclusions from 
these observation for the non-observable state of the object. This state q„ contains 
the discrete class and the continuous pose of the object 

In the context of a Bayesian approach, the knowledge on the object’s state is given 
in form of the a posteriori density p{qn\fn, o-n-i, fn-i, ■ ■ ■ , cto, /o)- This density can 
be calculated from 

P{Qn\fn^ ■ ■ ■ t ttO; fo) — ~j~pi,Qn \ ^n—l-i fn—1^ • ■ ■ : tt-O; fo)p{fn\Qn) ( 1 ) 

with the normalizing constant 

~ Pifri'i ^n—h ■ • ■ 5 ^ 0 ; fo)- (^) 

The density p(qn|a„_i, /„_i, • ■ • , Oq, fo) can be written as 
P{Qn\t^n—li fn— 1 ^ ■ ■ • > ttO) fo) 

I p(qn I *7n — 1 5 r 1 1 , . . . , Q -0 ; ^o ) 1 (3) 

qn-l 

with the Markov assumption p(q„|q„_i, a„_i, ...,qo, Qq) = p(q„|q„_i, a„-i) for 
the state transition. This probability depends only on the camera movement an-\. The 
inaccuracy of the camera movement is modeled with a normally distributed noise com- 
ponent so that the state transition probability can be written as p(q„|q„_i, a„_i) = 
Af{qn-i + with the covariance matrix S of the inaccuracy of the camera 

movement. If one deals with discrete states q„, the integral in equation simply be- 
comes a sum 

P{qn\fn-l,---,fo)= '^p{qn\qn-l,an-l)p{qn-l\fn-l,---,fo) (4) 

Qn-l 



that can easily be evaluated in an analytical way. For example, to classify an object 
in a sequence of images with qn = { p{qn\qn-i, o-n-i) degrades to 



P{qn\qn—ljtlji—i) — 



1 if q„ = qn -1 
0 otherwise 



(5) 



since the object class does not change if the camera is moved, and consequently equa- 
tion ® must have an analytically solution. 

But we want to use the fusion of multiple view for our viewpoint selection approach 
tm where we have to deal with localization of objects in continuous pose spaces and 
consequently states q„ with continuous pose parameters. For that reason it is no longer 
possible to simplify equation © to equation I©. 

The classic approach for solving this recursive density propagation is the well-known 
Kalman Filter ©. But in computer vision the necessary assumptions for the Kalman 
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Filter, e.g. p( /„ | ) being normally distributed, are often not valid. In real world applica- 

tions this density p{fn\Qn) usually is not normally distributed due to object ambiguities, 
sensor noise, occlusion, etc. This is a problem since it leads to a distribution which is not 
analytically computable. An approach for the complicated handling of such multimodal 
densities are the so called particle filters. The basic idea is to approximate the a posteriori 
density by a set of weighted particles. In our approach we use the Condensation algo- 
rithm (CoNditional DENSity propaATiON) |3|. It uses a sample set = {c", . . . , c^} 
to approximate the multimodal probability distribution in equation CJ. Please note that 
we do not only have a continuous state space for but a mixed discrete/continuous 
state space for object class and pose as mentioned at the beginning of this section. The 
practical procedure of applying the Condensation to the fusion problem is illustrated 
in the next section. 



2.2 Condensation Algorithm for Fusion of Multiple Views 



In this section we want to show, how to use the Condensation algorithm for the fusion 
of multiple views. 

As we want to classify and localize objects, 
we need to include the class and pose of the ob- 
ject into our state q„. In our experimental setup 
we move our camera on a hemisphere around 
the object (see Fig. Ell. Consequently, the pose 
of the object is modeled as the viewing posi- 
tion on a hemisphere (azimuthal and colatitude 
angles). This leads to the following definitions 
of the state qn = (17k ct" and the sam- 
ples c" = (17 k d" "'ith the class 12 k, 

the azimuthal a G [0°; 360°) and the colatitude 
(3 G [0°; 90°]. In Fig.Qthe pose space is illustrated. The camera movements are defined 
accordingly as a„ = (Z\a„ with and Z\/3„ denoting the relative azimuthal 

and colatitude change of the viewing position of the camera 

In the practical realization of the Condensation, one starts with an initial sample 
set C° = {ci, . . . ,c^} with samples distributed uniformly over the state space. For the 
generation of a new sample set C", samples c” are 




Fig. 2. Experimental setup and the possi 
ble pose space 



1 . drawn from C" ^ with probability 



K 

E 

i=i 



( 6 ) 



EPifn-l\c]-^) 



2. propagated with the sample transition model 



c, = c 



n— 1 




with 



Ta 

rp 



■ N{Aan,(Ta) 

■ Af{A/3n,crp) 



(7) 
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Table 1. Recognition rates for different sizes K of the sample set. The transition noise parameters 
are set to = 1-8° and ap = 1.5°. N denotes the number of fusioned images 




Object 


N=1 


K = 
N=2 


43400 

iV=5 


A^=10 


ol 


32% 


16% 


16% 


16% 


o2 


48% 


84% 


92% 


92% 


o3 


16% 


36% 


60% 


60% 


o4 


24% 


56% 


64% 


68% 


o5 


64% 


88% 


88% 


92% 


o6 


24% 


52% 


76% 


80% 


o7 


40% 


80% 


88% 


88% 


0 


35% 


59% 


69% 


71% 



and the variance parameters Ua and ap of the azimuthal and colatitude Gaussian 
noise Af{Aan,<7a) and Af{APn,<^i 3 )- They model the inaccuracy of the camera 
movement under the assumption that the error of the azimuthal and colatitude move- 
ments of the camera are independent of each other. 

3 . evaluated in the image by p(/„ | c” ) . 

For a detailed explanation on the theoretical background of the approximation of equa- 
tion CQl by the sample set cf. Q. 

It is important to note that it is absolutely necessary to include the class into the 
object state (and therewith also into the samples c"). An obvious idea that would 
omit this is to set up several sample sets - one for each object class - and perform 
the Condensation separately on each set. But this would not result in an integrated 
classihcation/localization, but in separated localizations on each set under the assumption 
of observing the corresponding object class. No fusion of the object class over the 
sequence of images would be done in that case. 



3 Experiments 

For the experiments presented in this section we have decided for an appearance based 
classifier using the Eigenspace approach in a statistical variation similar to Q . As already 
proposed the Condensation algorithm is independent of the used classifier as long as 
the classifier is able to evaluate p{fn\Qn)- Our classifiers projects an image /„ into the 
three-dimensional Eigenspace and evaluates the resulting feature vector for the trained 
normal distribution with pose parameters that are closest to the given pose. The intention 
of three-dimensional Eigenspace is to force big importance to the fusion aspect as the 
chosen low dimensional Eigenspace of course is not suited to produce optimal feature 
vectors. 

Our data set consists of the seven toy manikins shown in Eig. [I] The objects have 
been selected in a way that they are strongly ambiguous from some viewpoints. The 
objects o4 to o7 even cannot be classified with one view so that a fusion of multiple 
views is essential. The evaluation of our fusion approach was done with 25 sequences of 
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the transition noise parameters (Jq and cr^. The values of 95% (P95), 90% (P90), 75% (P75), 
size of the sample set is K— 43400. N denotes 50% (P50). Size of sample set 7('= 43400, tran- 
the number of fusioned images sition noise parameters (7^= 1.8°, <t/ 3 = 1.5° 



10 images each per object. The camera movements a were chosen randomly from the 
- within the mechanical limits of 0.03° - continuous space of possible movements. 

In Table E we show the recognition rates for different sizes K of the sample set. As 
expected, the quality of classification increases with the number N of fused images. It 
also turns out that the size of the sample set has a noticeable influence on the recognition 
rates as the approximation of equation O is more accurate for larger sample sets. 

Another important point we investigated was the influence of the noise parameters 
a a and from equation Q on the recognition rate. In Fig. 0 the recognition rates for 

different transition noise settings are shown. As it can be seen, too much transition noise 
(large and ap) performs better than insufficient transition noise. The reason for that 
is that small da and ap cause the samples in the sample set to be clustered at a very 
“narrow” area with the consequence that errors in the camera movement and localization 
are not sufficiently compensated. In contrast, too much noise spreads the samples too 
far. 

The results of the experiments for the localization accuracy are shown in Fig.0 The 
accurateness is given with the so called percentile values, which describe the limits of 
the localization error if the classification is correct and only the X% best localizations 
are taken into account. For example, the percentile value P90 expresses the largest 
localization error within the 90% most accurate localizations. As it can be seen in Fig. 01 
the P90 localization error drops from 50° in the first image down to 13° after ten images. 

The computation time needed for one fusion step is about 1.8 seconds on a LINUX PC 
(AMD Athlon IGHz) for the sample set with K = 43400 samples. As the computational 
effort scales linear to the size of the sample set, we are able to fuse 7 images per 
second for the small sample set with K = 3500 samples which already provides very 
reasonable classification rates. We also want to note that the Condensation algorithm 
can be parallelized very well so that even real-time applications can be realized using 
our approach. 
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4 Conclusion 

In this paper we have presented a general approach for the fusion of multiple views for 
active object recognition. Using the Condensation algorithm we are independent of 
the chosen statistical classifier. Other advantages of our approach are its scalability of the 
size of the sample set and possibility to parallelize the Condensation algorithm. In the 
experiments we have shown that our approach is well suited for the fusion of multiple 
views as we were able to double the overall classification rate from 35% to 71% and 
increase of the classification rate for single objects of up to 233%. 

Presently we use randomly chosen views for our fusion. But we expect that far better 
classification rates will be reached after fewer views if we combine our fusion approach 
with our viewpoint selection ll3l'2ll . The combination of these two approaches for the 
selection of views and their fusion will result in a system that is still independent of the 
used classifier and well-suited for the given task of classifying ambiguous objects. 

Open questions in our approach are the minimal necessary size of the sample set 
and the optimal parameters for the noise transition models. Furthermore other sample 
techniques are to be evaluated. 
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Abstract. Logistic regression is presumably the most popular representative of 
probabilistic discriminative classifiers. In this paper, a kernel variant of logistic 
regression is introduced as an iteratively re-weighted least-squares algorithm in 
kernel-induced feature spaces. This formulation allows us to apply highly efficient 
approximation methods that are capable of dealing with large-scale problems. 
For multi-class problems, a pairwise coupling procedure is proposed. Pairwise 
coupling for “kemelized” logistic regression effectively overcomes conceptual 
and numerical problems of standard multi-class kernel classifiers. 



1 Introduction 



Classifiers can be partitioned into two main groups, namely informative and discrim- 
inative ones. In the informative approach, the classes are described by modeling their 
structure, i.e. their generative statistical model. Starting from these class models, the 
posterior distribution of the labels is derived via the Bayes formula. The most popular 
method of informative kind is classical Linear Discriminant Analysis (LDA). However, 
the informative approach has a clear disadvantage: modeling the classes is usually a 
much harder problem than solving the classification problem directly. 

In contrast to the informative approach, discriminative classifiers focus on modeling 
the decision boundaries or the class probabilities directly. No attempt is made to model 
the underlying class densities. In general, they are more robust as informative ones, since 
less assumptions about the classes are made. The most popular discriminative method 
is logistic regression (LOGREG), Hi- The aim of logistic regression is to produce an 
estimate of the posterior probability of membership in each of the c classes. Thus, be- 
sides predicting class labels, LOGREG additionally provides a probabilistic confidence 
measure about this labeling. This allows us to adapt to varying class priors. 

A different approach to discriminative classification is given by the Support Vector 
(SV) method. Within a maximum entropy framework, it can be viewed as the discrim- 
inative model that makes the least assumptions about the estimated model parameters. 
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cf. H. Compared to LOGREG, however, the main drawback of the SVM is the absence 
of probabilistic outputsO 

In this paper, particular emphasis is put on a nonlinear “kernelized” variant of logistic 
regression. Compared to kernel variants of Discriminant Analysis (see dUi) and to the 
SVM, the kernelized LOGREG model combines the conceptual advantages of both 
methods: 

1. it is a discriminative method that overcomes the problem of estimating class condi- 
tional densities; 

2. it has a clear probabilistic interpretation that allows us to quantify a confidence level 
for class assignments. 

Concerning multi-class problems, the availability of probabilistic outputs allows 
us to overcome another shortcoming of the SVM: in the usual SVM framework, a 
multi-class problem with c classes is treated as a collection of c "one-against-all-others" 
subproblems, together with some principle of combining the c outputs. This treatment 
of multi-class problems, however, bears two disadvantages, both of a conceptual and a 
technical nature: (i) separating one class from all others may be an unnecessarily hard 
problem; (ii) all c subproblems are stated as quadratic optimization problems over the/u/Z 
learning set. For large-scale problems, this can impose unacceptably high computational 
costs. 

If, on the other hand, we are given posterior probabilities of class membership, we can 
apply a different multi-class strategy: instead of solving c one-against-all problems, we 
might solve c(c— 1) /2 pairwise classification problems, and try to couple the probabilities 
in a suitable way. Methods of this kind have been introduced in m and are referred to as 
pairwise coupling. Since kernelized LOGREG provides us with estimates of posterior 
probabilities, we can directly generalize it to multi-class problems by way of “plugging” 
the posterior estimates into the pairwise coupling procedure. 

The main focus of this paper concerns pairwise coupling methods for kernel clas- 
sifier^ For kernel-based algorithms in general, the computational efficiency is mostly 
determined by the number of training samples. Thus, pairwise coupling schemes also 
overcome the numerical problems of the one-against-all strategy: it is much easier to 
solve c(c — l)/2 small problems that to solve c large problems. This leads to a reduction 
of computational costs that scales linear in the number of classes. We conclude this 
paper with performance studies for large-scale problems. These experiments effectively 
demonstrate that kernelized LOGREG attains a level of accuracy comparable to the 
SVM, while additionally providing the user with posterior estimates for class member- 
ship. Moreover, concerning the computational costs for solving multi-class problems, it 
outperforms one of the best SVM optimization packages available. 



' Some strategies for approximating SVM posterior estimates in a post-processing step have been 
reported in the literature, see e.g. In this paper, however, we restrict our attention to fully 
probabilistic models. 

^ A recent overview over kernel methods can be found in El 
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2 Kernelized Logistic Regression 



The problem of classification formally consists of assigning observed vectors x S 
into one of c classes. A classifier is a mapping that assigns labels to observations. In 
practice, a classifier is trained on a set of observed i.i.d. data-label pairs {{xi, yi)}^i, 
drawn from the unknown joint density p{x, y) = p{y\x)p{x) . For convenience in what 
follows, we define a discriminant function for class k as 

gk(x)=log^ fc=l,...,c-l. (1) 

P[y = c\x) 



Assuming a linear discriminant function leads us to the logistic regression (LOGREG) 
model: gk{x) = 

Considering a two-class problem with labels {0,1}, it is sufficient to represent 
P(l|a;), since P(0|a;) = 1 — P(l|a;). Thus, we can write the “success probability” in 
the form 



-K^{x) ■= P(l|a;) 



1 

1 + exp{— (in^a;)} 



( 2 ) 



For discriminative classifiers like LOGREG, the model parameters are chosen by max- 
imizing the conditional log-likelihood: 



N 

K'w) = X! [yi^ogn^{x,) -f (1 - yi)log(l - 7r^„(a:i))] . (3) 

i=l 

In order to find the optimizing weight vector w, we wish to solve the equation system 
= 0. Since the := ^^{xi) depend nonlinearly on w, however, this system 
cannot be solved analytically and iterative techniques must be applied. The Fisher scoring 
method updates the parameter estimates w at the r-th step by 

Wr+l = Wr - {E[H]}~^\7^^l{w), ( 4 ) 

with H being the Hessian of Z0The scoring equation @ can be restated as an Iterated Re- 
weighted Least Squares (IRLS) problem, cf. m- Denoting with X the design matrix 
(the rows are the input vectors), the Hessian is equal to {—X'^WX), where W is a 
diagonal matrix: 



W = diag {7Ti(l - 7Ti), . . . ,7rAr(l - ttjv)}. 

The gradient of I (the scores) can be written as = X^We, where e is a vector 

with entries Cj = {yj — TTj)jWjj. Forming a variable 

= Xwr -f e, 

^ Throughout this paper we have dropped the constant h in the more general form g{x) = 
w^x + b. We assume that the data vectors are augmented hy an additional entry of one. 

For LOGREG, the Hessian coincides with its expectation: H = E[H], For further details on 
Fisher’s method of scoring the reader is referred to |9|. 
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the scoring updates read 



{X'^WrX)Wr+l = X^WrQr- ( 5 ) 

1 /2 

These are the normal form equations of a least squares problem with input matrix Wr X 

1 /2 

and dependent variables Wr Qr- The values Wa are functions of the actual Wr, so that 
© must be iterated. 

Direct optimization of the likelihood (0, however, often leads to severe overfitting 
problems, and a preference for smooth functions is usually encoded by introducing priors 
over the weights in. In a regularization context, such prior information can be interpreted 
as adding some bias to maximum likelihood parameter estimates in order to reduce the 
estimator’s variance. The common choice of a spherical Gaussian prior distribution with 
covariance oc A“^/ leads to a ridge regression model, i1 Hi . The regularized update 
equations read 



{X'^WrX + XI)Wr+l = X'^WrQr- ( 6 ) 

The above equation states LOGREG as a regularized IRES problem. This allows us to 
extend the linear model to nonlinear kernel variants: each stage of iteration reduces to 
solving a system of linear equations, for which it is known that the optimizing weight 
vector can be expanded in terms of the input vectors, cf. iQ: 

N 

w = XiUi = X^a. (7) 

i=l 

Substituting this expansion of w into the update equation 0 and introducing the dot 
product matrix {K)ij = {x^ ■ Xj), K — XX"^ , we can write 

{KWrK + \K)OLr+l = KWrq'r, ( 8 ) 

with q'j. = KoLr + e. Equation Q can be simplified to 

{K + XW-^)ar+i = q'r. (9) 

With the usual kernel trick, the dot products can be substituted by kernel functions 
satisfying Mercer’s condition. This leads us to a nonlinear generalization of LOGREG 
in kernel feature spaces which we call kLOGREG. 

The matrix {K + XW~^) is symmetric, and the optimizing vector a^+i can be com- 
puted in a highly efficient way by applying approximative conjugate gradient inversion 
techniques, see cf. 11211 . p. 83. The availability of efficient approximation techniques 
from the well-studied field of numerical linear algebra constitutes the main advantage 
over a related approach to kLOGREG, presented in II .111 . The latter algorithm computes 
the optimal coefficients by a sequential approach. The problem with this on-line 
algorithm, however, is the following: for each new observation Xt, t = 1,2, . . . it im- 
poses computational costs of the order Given a training set of N observations in 

total, this accumulates to an 0{N^) process, for which to our knowledge no efficient 
approximation methods are known. 
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3 Pairwise Coupling for Multi-class Problems 

Typically two-class problems tend to be much easier to learn than multi-class problems. 
While for two-class problems only one decision boundary must be inferred, the general 
c-class setting requires us to apply a strategy for coupling decision rules. In the standard 
approach to this problem, c two-class classihers are trained in order to separate each of the 
classes against all others. These decision rules are then coupled either in a probabilistic 
way (e.g. for LDA) or by some heuristic procedure (e.g. for the SVM). 

A different approach to the multi-class problem was proposed in (2|. The central idea 
is to learn c(c— l)/2 pairwise decision rules and to couple the pairwise class probability 
estimates into a joint probability estimate for all c classes. It is obvious, that this strategy 
is only applicable for pairwise classifiers with probabilistic outputs0From a theoretical 
viewpoint, pairwise coupling bears some advantages: (i) jointly optimizing over all c 
classes may impose unnecessary problems, pairwise separation might be much simpler; 
(ii) we can select a highly specific model for each of the pairwise subproblems. 

Concerning kernel classifiers in particular, pairwise coupling is also attractive for 
practical reasons. For kernel methods, the computational cost are dominated hy the size of 
the training set, N. For example, conjugate gradient approximations forkLOGREG scale 
as 0{N‘^ ■ m), with m denoting the number of conjugate-gradient iterations. Keeping 
m fixed leads us to a 0{N‘^) dependency as a lower bound on the real costs0 Let us 
now consider c classes, each of which contains Nc training samples. For a one-against- 
all strategy, we have costs scaling as 0{c{cNcY) = 0(c^(A"c)^)- For the pairwise 
approach, this reduces to 0(1/2 c(c — 1)(2A/,)^) = 0(2(c^ — c)(A/.)^). Thus, we 
have a reduction of computational costs inverse proportional to the number of classes. 
Pairwise coupling can be formalized as follows: considering a set of events 
suppose we are given probabilities Vij — Prob {Ai\Ai or Aj). Our goal is to couple the 
’s into a set of probabilities pi = Prob (A^). This problem has no general solution, 
but in m the following approximation is suggested: introducing a new set of auxiliary 
variables /i^ = ^ , we wish to find pfis such that the corresponding /ty ’s are in some 

sense “close” to the observed ’s. A suitable closeness measure is the Kullback-Leibler 
divergence between and fiij 

= Y. ^-0 log ^ + (1 - no) log ■ (10) 

The associated score equations read 

Y, Au = X! ’ * = 1. ■ • ■ 1 c, subject to ^ Pi = 1. (11) 

^ In a former version of □ available as Tech. Rep. at the University of Toronto, it has been 
suggested to apply “approximative” pairwise coupling to the SVM. However, we feel that this 
approach is not very promising since it lacks a clear probabilistic interpretation. 

^ For the SVM, the situation is more difficult and heavily depends on implementation details. As 
an example, the popular LOQO package 1141 . has even O(N^) complexity, due to a Cholesky 
decomposition of a A x A matrix. Subset methods are usually much more efficient, but their 
performance is problem-dependent, and thus difficult to analyze. 
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Starting with an initial guess for the pi and corresponding pij, we can compute the pi’s 
that minimize (HI3 by iterating 

- Pi Pi ■ { '^ij) / ( '^j^i Pij) 

- renormalize the pi’s and recompute the /ty . 

Suppose, we have successfully trained all pairwise kLOGREG classifiers. Then, we can 
predict the class membership of a new observation x* * as follows: 



1. Evaluate the c(c — 1) /2 classification rules to obtain 

Tij{x^) = Prob (x, G class i |x, G class i or x, G class j), 
and initialize fiij = rij . 

2. Starting with an initial guess for the p^’s, run the above iterations. 

3. We finally obtain the posterior probabilities for class membership of pattern x, . 



4 Experiments 

Here we present results for the “MPl chairs”, and “Isolet” datasetsfjln both cases the 
number of classes is relatively high: the chair dataset consists of downscaled images 
from 25 different classes of chairs; the Isolet dataset contains spoken names of all 26 
letters of the alphabet. We compared both the prediction accuracy and the computational 
costs of a “state-of-the-art” S VM packag^ and kLOGREG. The results are summarized 
in tabled We conclude, that pairwise coupled kLOGREG attains a level of prediction 
accuracy comparable to the SVM, while imposing significantly lower computational 
costs. Concerning the training times, the reader should notice that we are comparing 
the highly tuned SVMTorch optimization package with our straight-forward kLOGREG 
implementation, which we consider to yet possess ample opportunities for further opti- 
mization. 



Table 1. Test error rates (e) and computation times (t) on the MPI chair and the Isolet database, 
c = number of classes, = number of samples per class, I dim = input dimension, Nt = size of 
test set. The training times (t) are measured on a 500 MHz PC. 



Dataset 


c 


W 


Idim 


Nt 


SVM 


kLOGREG 


Chairs Images 


25 


89 


256 


2500 


e= 1.48% 


t= 152 s 


e= 1.52% 


t = 53 s 


Images -l- edges 


25 


89 


1280 


2500 


e = 0.80% 


t = 19 min 


e = 0.84% 


t = 3:05 min 


Isolet 


26 


240 


617 


1560 


e = 3.0% 


t = 21 min 


e = 3.0% 


t = 18 min 



^ Available via ftp://ftp.mpik-tueb.mpg.de/pub/ chair .dataset and 
http: //www. ics .uci . edu/~mlearn/MLRepository .html respectively. 

* We used the SVMTorch // V1.07 implementation, see II Sll . 
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5 Conclusion 

In this paper we have presented a new approach to multi-class classihcation with ker- 
nel methods. In particular, we have focused on a kernelized variant of classical lo- 
gistic regression, which we name kLOGREG. The kLOGREG classiher combines the 
advantages of related versions of kernel methods: it is a discriminative classifier that 
overcomes the problem of estimating class models, and it has a clear probabilistic inter- 
pretation. We have stated kLOGREG as an iteratively re- weighted least-squares problem 
in kernel feature spaces. The real payoff of this algorithmic formulation is the applicabil- 
ity of highly efficient approximation techniques from the well-studied field of numerical 
linear algebra. 

Concerning multi-class problems, we can use the kLOGREG classifier as a build- 
ing block in a pairwise coupling procedure. The main idea of pairwise coupling is to 
couple all pairwise decision rules into an estimate for the posterior probability of class 
membership. Besides of conceptual advantages over classical ways of handling multi- 
class problems, this technique additionally has a clear numerical advantage: for a fixed 
number of training patterns, the computational costs reduce linearly in the number of 
classes. 

Experiments for large-scale problems with many classes have effectively demon- 
strated that kLOGREG attains a level of accuracy comparable to the SVM. Moreover, 
concerning the computational costs, our straight-forward implementation which basi- 
cally uses routines from Numerical Recipes, 11121 . outperformed one of the best SVM 
packages available. We thus conclude that kLOGREG is a highly suited method for 
dealing with multi-class problems that require us to quantify the uncertainty about the 
predicted class labels. 

Acknowledgments. The author wishes to thank J. Buhmann, M. Braun and L. Hermes 
for fruitful discussions. Thanks for financial support go to German Research Council 
(DEG). 
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Abstract. In this paper we present a new approach for the localization and 
classification of 2-D objects that are situated in heterogeneous background 
or are partially occluded. We use an appearance-based approach and model 
the local features derived from wavelet multiresolution analysis by statistical 
density functions. In addition to the object model we define a new model for 
the background and a function that assigns the single feature vectors either to 
the object or to the background. Here, the background is modelled as uniform 
distribution, therefore we need for all possible backgrounds only one density 
function. Experimental results show that this model is well suited for this 
recognition task. 

Keywords: object recognition, statistical modelling, background model 



1 Introduction 

There are several approaches for object recognition. Since the approaches that use the 
results of a segmentation process like lines or vertices suffer from segmentation errors 
and the loss of information, we use an appearance-based approach; i. e. the image data, 
e. g. the pixel intensities, are used directly without a previous segmentation process. 
One appearance-based approach is the “eigenspace” introduced by H that use principal 
component analysis and encode the data in so-called “eigen-images”. Other authors ap- 
ply histograms; the most well-known approach are the “multidimensional receptive field 
histograms” of H which contain the results of local filtering, e. g. by Gaussian derivative 
filters. O use a statistical approach for the recognition and model the features by Gaus- 
sian mixtures. We use and extend the statistical model of |i5): local feature vectors are 
calculated by the coefficients of the multiresolution analysis using Johnston- Wavelets. 
The object features are modelled statistically as normal distributions by parametric den- 
sity functions. As we will show in the experiments this approach is very insensitive to 
changes in the illumination. 

For the recognition in real environments very often the objects reside not in the 
homogeneous background of the training, but in cluttered background and some parts of 
the objects are occluded. For this purpose 0| , who use the eigenspace approach, try to hnd 

* This work was funded by the German Science Foundation (DEG) Graduate Research Center 
3-D Image Analysis and Synthesis. Only the authors are responsible for the content. 
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Fig. 1. Left: Image covered by a grid, the object is enclosed by a tight boundary (black line); the 
respective old rectangular hox is plotted in gray, because of its form it enclose not only object 
features but also background features, right: For movements of the object inside the image plane 
the object grid is transformed with the same internal rotation 4>int and internal translation tint as 
the object 



n object features in the image that are neither affected by the background nor occluded. 
For this reason they generate pose hypotheses and select the n best fitting features. In 
contrast other authors explicitly model the background and assign the features to the 
object or to the background. For example, II, who use vertices as features, assume a 
uniform distribution for the position of the background features and model a probabilistic 
weighted assignment. Whereas in d] the features are assigned neither to the object nor 
to the background. Since medical radiographs were classified, the background model 
has a constant gray value of zero. Also for our approach a background model was 
presented in [6J and Q: the known background was trained as Gaussian distribution 
and a weighted assignment was applied. We propose a new model for this approach: 
the background is modelled as uniform distribution over the possible values of the 
feature vectors, therefore we are independent of the current background. We define a 
assignment function that assigns each of the local feature vectors either to the object or 
to the background, depending whether the calculated value of the object density or of the 
background density is higher for this local feature vector. The recognition is performed 
hierarchically by a maximum likelihood estimation, whereby an accelerated algorithm 
is used for the global search. 

In the following section we shortly outline the object model for homogeneous back- 
ground and in section 3 we describe our new model for heterogeneous background and 
occlusions. In section 4 the experiments and the results are presented. Finally we end 
with a summary of the results and the future work in section 5. 

2 Object Model for Homogenous Background 

A grid with the sampling resolution Vg, whereby s is the index for the scale, is laid over 
the square image f as one can see in figure Q] These grid locations will be summarized 
in the following as Xg = {;Cm,s}m=o,...,M-i, 3^m,s G On each grid-point a local 
feature vector Cg{xm,g) with two components is calculated by the coefficients of the 
multiresolution analysis using discrete Johnston 8-TAP wavelets and is interpreted as 
random variable for the statistical model. Thereby the randomness among others is the 
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noise of the image sampling process and changes in the lighting conditions. To simplify 
the notation, the index s is omitted in the following. 

For the object model a close boundary is laid around the object. In |0 only a rect- 
angular box was implemented and it was positioned manually. In contrast now the form 
of the boundary is arbitrary so that it enclose the object much better (see left image of 
figure^). Besides it is calculated automatically during the training, wherefore only one 
image of the object in front of a dark background is necessary. During the recognition 
this trained form of the bounded region A is used. We assume that the feature vectors 
CA,m inside the bounded region A C X belong to the object, and are statistically inde- 
pendent from the feature vectors Cx\A,k outside the bounded region X \ A. Therefore, 
for the object model we only need to consider the (object) feature vectors CA,m, con- 
catenated written as vector ca - Now, the object can be described by the density function 
p{ca\B, (j}, t) depending on the learned model parameter set B, the rotation (j) = ^int 
and the translation t — tint- 

Further we assume that the features are normally distributed. In |5J a statistical 
dependency between adjacent feature vectors in a row was modelled, but for arbitrary 
objects the results is worse than for statistical independency. Therefore we assume that 
the single feature vectors are statistically independent from each other. So the density 
functions can be written as 

p{cA\B,(j),t) = p{cA t) , ( 1 ) 

Xm&A 

whereby is the mean vector and Sra the covariance matrix of the m-th feature 
vector. Because of the statistically independence Urn is a diagonal matrix. 

For the localization we perform a maximum likelihood estimation over all possible 
rotations (f) = and transformations t = tint ^ 

= argmaxp(cA|S^,(^,t) , (2) 

and for the classification an additional maximum likelihood estimation over all possible 
classes k: 



(fc, t) = argmaxargmaxp(cA|SK, (/i, t) . (3) 

ft (0,t) 

For accelerating the localization, first a rough localization is conducted on a rough 
resolution, followed by a refinement on a finer resolution. For further details see Q). 



3 Model for Heterogenous Background and Occlusions 

For occlusions the assumption that all the feature vectors inside the region A belong to 
the object is wrong. Besides for heterogeneous background the features vectors at the 
border of the object that cover not only the object but also partially the background are 
modified. Therefore [01 try to find n of the totally N object features that are not affected 
by the heterogeneous background and the occlusion il- But for this approach there is a 
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risk to confuse similarly looking objects - like the two matchboxes in figureEl(below). 
For this purpose, we consider all local feature vectors CA,m in the bounded region A and 
dehne a background model and a assignment function ^ G {0, 1}^. Whereby the m-th 
component (^ra of C assigns the local feature vector CA.m to the object {C,rn = 1) or to 
the background (Cm = 0). So the density function 

p{cA\B,4>,t) = ^p{cAX\B,((),t) (4) 

c 

also includes the assignment function C and becomes a mixture density. Now B 
includes the learned parameters of the object as well the learned parameters Bq of 
the background. In Ihj and Q the background has to been known already during the 
training and was trained as a normal distribution, i. e. for each different background an 
own background density has to been trained. Besides statistically dependencies between 
the feature vectors in a row were modelled. For some experiments also for the assignment 
function a row-dependency was modelled [Q , whereas for other experiments statistically 
independence of the assignments was supposed 1 6 1. Thereby weighted assignments were 
modelled, i. e. a local feature vector CA,m belongs with a probability p(Cm = 1) to the 
object and p(Cm = 0) to the background. This leads to a very complex model. 

To be independent from the current background and handle every possible back- 
ground by only one background density, we model the background as a uniform dis- 
tribution over the possible values of the feature vectors. So nothing has to be known 
about the background a priori; the background density depends only on chosen feature 
set, e. g. the wavelet type for hltering. Additionally, it is identical for all feature vectors 
and therefore it is independent from the transformations tint and (pint- As in section 2 
we assume statistically independence of the features and also of the assignments, so the 
density function in © can be transformed to 

p{cA\B,(p,t) n 

^ Xm^A 

Cm 

= n E P{Cm') p{^A,m\Cm^ B ^ . (5) 

Cm 

This is a much simpler expression than 0: now we no longer have a marginalization 
about all possible assignments C for all features ca, but for each single feature vector 
CA,m a marginalization about the single assignments Cm- 

The assignment Cm is a hidden information. For example we do not know, how much 
of an object is occluded. So we set the a priori probability that a local feature vector CA,m 
belongs to the object or to the background as equal. Further we model Cm as a (0,1)- 
decision, i. e. a feature vector CA,m belongs either to the object or to the background. 
The decision is taken during the localization process. Thereby C is chosen so that the 
density value p{cA\B,cj),t) is maximized. That is the density value for each feature 
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Fig. 2. Left: for each possible internal transformation all the feature vectors have to been interpo- 
lated, right: for the same rotation only another translation the most feature vectors can be reused 



vector CA,m (see ©) 



^ ^ P{Cm) ( 6 ) 

Cm 

has to be maximized. This can be done by the assignment Cm 

Cm — tirgmax(p(c^^^ I Bi^(f^t^^p(^CA^m\Cm — Oj-^o)) ; (7) 

Cm 

and setting the probability p{Cm) for the respective assignment to one, the probability 
for the other assignment to zero. 

For example, for the Johnston wavelets used in the experiments of sectionElthe object 
density p{cA,m\Cm = I, t) for a not occluded feature vector CA^m lays typically 

between e~^ and e^. For occlusion it becomes very low: between and In 

this case the feature vector CA,m is assigned to the background and the background 
density p{cA,m\Cm = 0, Bq) = ® (that is the value for the used Johnston wavelets) 

is chosen for this feature vector CA,m- 

For the localization and classification a maximum likelihood estimation is performed 
as described in Q) and (Q. We also tested a heuristic measurement in section0for the 
classification. The single objects differ in their size, i. e. also in the number N of their 
local object feature vectors CA,m inside the bounded region A. Since non fitting object 
features are assigned to the background and the background density value is used, there 
are some misclassification caused by the different size of the objects. Therefore we 
normalize the density before the maximum likelihood estimation by the number N of 
local object feature vectors CA,m- 

{k,4>,t) = argmaxargmax yp{cA\B^,4>,t) . (8) 

Because of the statistical independence (see ©) the expression in (0) is the geometric 
mean of the density values p{cA.m\Cm, of all feature vectors CA^m with the 

respective assignments Cm- 

Further we speed up the first step of the localization process that starts with a global 
search over all possible internal rotations ^int and translations tint on a rough resolution. 



Appearance-Based Statistical Object Recognition 259 




Fig. 3. The 5 objects in the different environments: box on the training background, matchbox 1 
on the black background, matchbox 2 on the heterogenous background, car 1 with 25% black 
occlusion, car 2 with 50% heterogeneous occlusion 



Although we evaluate the density function only at discrete points of this transformation 
space, we have to calculate the density value for 225 possible internal translations 
each with 36 possible internal rotations For each of the altogether 8100 possible 
transformations we have to interpolate about the 80 feature vectors as one can see in the 
left image of hgureQ 

Since the interpolation is computationally expensive, we change this algorithm. We 
interpolate the required area of the grid for each rotation only once and then trans- 
late the object grid according to the rotated coordinates axes in steps respective to the 
resolution rg. As visible in the right image of hgure|2|each interpolated feature vector 
can be used for many transformations. 



4 Experiments and Results 

For the experiments we used the hve objects in figure 0 The images were 256 pixels 
in square. For the training, 18 images of each object in different poses with the same 
illumination were taken, the background was nearly homogeneous with a pixel intensity 
about 60. We took 17 further images of each object in other positions for the tests. For 
the experiments with heterogeneous background we cut out the objects and pasted them 
in an absolute black background as well as in front of a mouse pad (see hgure0). For 
the occlusions we blacked out 25% respectively 50% of the object in the test images, 
as well as we covered the objects in the image with a part of the mouse pad. It has 



Table 1. Error rate in percent for only object density no background model, the old background 
model O, the new background model without and with normalization (see eq. for each model 
3*170 localization and classification experiments are performed, left: error rate for the localization, 
middle: for the classification, right: average computation time for one localization on a Pentium 
111 800 MHz 





localization 


classification 




one illumination 


het. hack. 


25% occl. 


50% occl. 


het. back. 


25% occl. 


50% occl. 


time 


only object dens. 


22,9% 


69,4% 


82,4% 


25,3% 


62,4% 


70,6% 


0,8 s 


background m. @ 


6,5% 


24,1% 


51,7% 


20,1% 


21,2% 


50,0% 


6,5 s 


new background m. 


0,0% 


0,0% 


7,1% 


0,0% 


0,0% 


4,7% 


1,3 s 


the same with norm. 




0,0% 


0,0% 


2,3% 


1,3 s 
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to be mentioned that a absolute black background or occlusion is a big difference to 
the training, because the first component of the local feature vectors is the logarithmic 
low-pass value of the wavelet analysis. So we got for each model 170 localization and 
170 classification experiments for heterogeneous background, 25% occlusion and 50% 
occlusion. For the background model two background classes were trained and used: 

the absolute black background and the mouse pad. 

For the recognition we used for the rough localization a resolution of = 8 pixels 
and for the refinement a resolution of = 4 pixels. The objects were searched in the 
whole Image for all internal rotations and translations. Thereby a localization is counted 
as wrong if the rotation error is bigger than 7° respectively the translation error bigger 
than 7 Pixels. 

As one can see in table [I] the simple object model for homogeneous background 
(section |2l could handle heterogenous background, but failed very often for occlusion. 
The old background model @ is better than the simple object model, but it is very 
slow and for 50% occlusion the error rate was about 50%. Whereas the new background 
model is much better and faster: for heterogeneous background and 25% occlusion there 
were no errors, and even for 50% occlusion the error rate was small. Further, by the 
normalization the classification error rate for 50% occlusion could be reduced from 
4,7% to 2,3%. Additionally, the new model is hve times faster than the old background 
model. 

For testing the robustness of this approach we also performed experiments with two 
different illuminations. We trained each object with 9 images taken with illumination 1 
and 9 images taken with illumination 2, where one of the three spotlights was switched 
off. Also for the test images we used these two illuminations. For the new background 
model the localization and classification error rate for heterogenous background and 
25% occlusion was still very small, only for 50% occlusion it increased. But it could be 
reduced to 4,8% by the normalization. 



5 Conclusions and Outlook 

In this paper we presented a new efficient background model for object recognition. 
The background is modelled as uniform distribution and therefore independent from 



Table 2. Error rates in percent for two illuminations: for only object density no background model, 
the old background model |6|, the new background model without and with normalization (see 
eq.IHl; for each model 3*170 localization and classification experiments are performed, left: error 
rate for the localization; right: for the classification, the average computation time is nearly the 
same as in table Q 





localization 


classification 


two illuminations 


het. back. 


25% occl. 


50% occl. 


het. back. 


25% occl. 


50% occl. 


only object dens. 


11,4% 


60,0% 


77,7% 


26,4% 


50,0% 


70,0% 


background m. O 


6,8% 


32,4% 


54,7% 


7,7% 


23,9% 


61,2% 


new background m. 


0,0% 


3,0% 


24,2% 


0,0% 


4,7% 


18,9% 


the same with norm. 




0,0% 


0,0% 


4,8% 
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the current used background, i. e. all possible backgrounds can be handled by only one 
background density function. We defined a assignment function that assigns each local 
feature vector either to the object or to the background. With this model we improved 
the recognition rate to nearly 100% for heterogeneous background and 25% occlusion 
and to nearly 95% for occlusion of 50%, even if two different lighting conditions are 
used. 

In the future we will extend the background model to 3-D objects. For this purpose 
we have to model the so far fix size of the bounded region as a function of the out of 
image plane transformations. This is necessary because the appearance and the size of 
the objects vary with these external transformations. In addition the assignment function 
is a good basis for multi object recognition. 
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Abstract. We consider the task of cell classification in fluorescent micrographs. 
We combine the use of Independent Component analysis as a preprocessing step 
and a Self-organizing Map for the resulting ICA feature space to classify image 
patches into cell and noncell images and to investigate the features of image 
patches in the vicinity of the classification border. We compare the classification 
performance of ICA bases of different size, generated from applying the infomax 
algorithm to image eigenspaces of different dimensionalities. We find an optimal 
performance for intermediate dimensionalities, characterized by the ICA basis 
patterns exhibiting salient features of an “idealized” cell shape, and we achieve 
classification results comparable to a previous approach based on PC A features. 

Keywords, independent component analysis, self organizing maps, image 
classification, feature computation 



1 Introduction 



In recent years, automation of sample preparation and imaging led to an increased number 
of digitized microscope images in microbiological research, refered to as micrographs. 
The evaluation of micrographs by human experts is an expensive, tiresome and error- 
prone task. Thus the development of efficient image based classification algorithms for 
a full-automatic micrograph evaluation or for decision support in cell classification is 
potentially extremely valuable to the biological and medical researchers. However, the 
development of suitable algorithms is a delicate task [9], because the cell images are 
usually severely degraded by factors such as out of focus blur, non-uniform illumination 
and highly irregular object structures. 

In the present paper we consider the use of ICA features in conjunction with a Self- 
organizing Map (SOM) for the identification of cell bodies in micrographs of fluorescent 
cells. This approach is part of a larger effort towards the development of a highly scalable 
system for segmentation of fluorescence micrographs in high throughput [8]. 

Our current system uses PCA features as a first step after which a neural network 
classifies image patches whether they are occupied by a fluorescent cell or not. Since the 
training of the system is based on data labelled by human experts, and since the decision 
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whether an image feature is a cell body or not is even difficult for human experts, we want 
to assist the development of improved classifiers with an analysis which image features 
may he particularly important for the expert’s decision. However, as is well known, 
PCA features tend to bear little resemblance to the objects from which they are derived. 
Therefore, we here investigate as an alternative approach the use of ICA features, which 
tend to be more localized and more amenable to human interpretation. To visualize the 
distribution of cells and noncells in feature space, we use a Self-organizing Map. Besides 
the usefulness of the combined method approach for visualization, we also find that it 
achieves classification results which are comparable to the previous system. 



2 Independent Component Analysis of Images 



ICA theory [4] is originated in the field of blind source separation on audiosignals. Its 
use for dimension reduction is also motivated by neuron response properties found in 
the visual system of mammals [10]. The goal of ICA is to find a set of statistically 
independent components that span the space of the input images. 

Each input image of D pixel is treated as a row vector of a TV x I?-matrix X (TV 
is the number of given images) G IR^. In the following, we always use centered 
data ((X) = 0). The given images are supposed to be a linear combination of K source 
images G IR^ which are considered as the rows of a. K x D matrix S: 



X = AS (1) 

The matrix A G IR^ ^ ^ is called mixing matrix. An approach to find the is to compute 
a so called decorrelation matrix W such that U = WX yields a K x D matrix whose 
rows are - up to scaling and permutation - equivalent to the K source images. We set 
S = U and thus WA = I. Figure 1 illustrates the described model by the example of 
image patches extracted from a micrograph. 

Various approaches for computing the decorrelation matrix W exist. Most of that 
restrict W to a square matrix. The algorithm used in this application is based on the 
infomax-algorithm introduced by Sejnowski and Bartlett in [3] and [2]. The idea is to 
maximize the entropy of the distribution P{y) of an auxiliary variable y = ^(WX) 
where 5 is a sigmoid squashing function. When W is such that P{y) approaches a 
uniform density (which is characterized by a maximal entropy), it factorizes and so does 
the corresponding density for the linearly transformed variables WX. Gradient ascent 
on the entropy yields the learning rule which can be put into a computationally more 
advantageous form by using the “natural gradient” [ 1 ]: 

AW cx (1 + (1 -2y) -x^) • W (2) 

While application of (2) for the training images x^ would be feasable, it has been 
observed in [2] and [5] that replacement of the x^ by a subset of the eigenvectores 
ftj, j = 1, K of the covariance matrix (XX^) leads to better learning perfor- 
mance. Geometrically, this means that the ICA components are sought in a subspace 
that captures the data variation except for a small part that corresponds to the neglected 
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Fig. 1. The ICA-model applied on image patches of micrographs: Statistically independent source 
images S transformed by an unknown mixing matrix A form the observed images X. The decor- 
relation matrix W such that WA = I applied on X yields independent components U with 

U = S. 



eigenvalues. At the same time, this offers a simple approach to compute only a limited 
number K of independent components from a much larger number N of image patches. 



3 ICA of Micrograph Image Patches 

We applied the ICA-algorithm to a set of 15 x 15 pixel subimages of a 658x517 pixel 
micrograph showing muscle tissue with fluorescence labeled lymphocyte cells. For the 
input matrix X G of the infomax-algorithm we chose the first K principal 

components of 131 image patches classified (by the human expert) as containing a cell. 

Most of the computed independent components resemble fragments of an “idealized” 
cell image whose shape can be described as a centered circle with clearly distinguishable 
inner region and surround. For some components, the fragment is picked from one of the 
image corners, others show a “banana-like” part along the cell border or a fragment of 
the inner region fitting to the cell border (see fig. 2 third row on the right). Each fragment 
is clearly distinguishable from the background with properties making it suitable as an 
edge detector. 

For a small number of input images, for example the hve principal components with 
the largest eigenvalues, the computed independent components resemble the eigencells. 
With an increasing number of ICA components, these break down into smaller pieces; 
ultimately, for K values approaching the number of image pixels, this leads to almost 
meaningless, pixel-sized components (see hg. 2 right) approximating the unit vectors 
along the pixel directions. 



4 Classification of Cell Images Using Self- Organizing Maps 

A Self-organizing Map (SOM) [6] is a widely used tool for visualizing and exploring 
n-dimensional data. It projects n-dimensional input data on a two dimensional grid of 
prototype vectors. Therefore it visualizes the distribution of the images in the feature 
space. Furthermore, a trained SOM can be used for classifying the input data. We take 
advantage of these attributes for analysing the usability of ICA computed features. 
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Fig. 2. Based on an arbitrary number of image patches (first column) a varying number K of 
principal components is computed (second column). These form the rows of the matrix Xk which 
is used to compute the corresponding weight matrix ~W k- The resulting independent components 
are shown on the right for various values of K. Small values of K yield more “holistic” features; 
for large -values the features become more and more fragmented down to an approximation of 
the pixel unit vectors (bottom row). 



One point of interest in analysing ICA for feature extraction is how to determine 
the number of independent components, and thereby the dimensionality of the feature 
space an image patch is projected to. For an experimental study, sets of 5, 10, 20 and 
50 independent components are computed from the described training set. The resulting 
filters are used to map the image patches to feature vectors: fki^i) = Ux^, i = 

l,...,7VandU G leads to /fe(x,) G AT G {5,10,20,50}. Subsets of the 

resulting independent components are shown in figure 2. A SOM with varying parameters 
is applied to the features /fc(xi) and the resulting distribution of the image patches on 
the nodes is obtained as well as the classification result. 

For training and testing the SOM we use a set of about 660 image patches. 131 of 
them are labeled as being occupied by a cell, following called cell, the remaining ones 
as not, called noncell. The further set is equivalent to the set of image patches which has 
been used to compute the ICA filters U. The image set is divided in three sets of about 
220 image patches each. Two sets are combined to the training set and the third is used for 
testing. After training the nodes are labeled as "cell-matching" and "noncell-matching" 
nodes by the prevailing label of the matching image patches. Figure 3 shows two SOMs 
of different dimensions. For each node the label is represented as well as one image 
patch arbitrarily chosen from the respective set of matching image patches. Comparing 
the labels of the image patches with the node labels averaged by cross validation on 
exchanging sets yields the misclassification rate. The following table shows the received 
rates by various feature dimensions and SOM architectures with exponential decreasing 
learning rate and neighborhood radius. 
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Fig. 3. Two Self-organizing Maps (SOM) with image patches and node labels after training. The 
represented image patches are chosen arbitrarily from the respective set of matching image patches. 
□ denotes a cell-matching node and x a noncell-matching node. 



dimension 

SOM 


misclassification rate 


rate of uncertain nodes 


5 


dimension 

10 


of feature 
20 


50 


din 

5 


nension 

10 


of feat 
20 


Lire 

50 


5x5 


0.0909 


0.0894 


0.0606 


0.0758 


0.55 


0.49 


0.48 


0.51 


10x10 


0.0863 


0.0712 


0.0545 


0.0697 


0.19 


0.17 


0.15 


0.17 


15x15 


0.0894 


0.0939 


0.0878 


0.0924 


0.07 


0.06 


0.04 


0.05 



On varying SOM architectures we observed that features based on 20 independent 
components are most useful for classification. A misclassification rate of 0.0545 is the 
best result we obtained using a lOx 10 SOM. 

A SOM allows to inspect the label statistics and image patches associated to a selected 
node. Thereby a deeper understanding of the classification performance can be obtained. 
Although labeled different, the image patches matching the same node look similar (see 
fig. 4 top right). Therefore, some of the SOM nodes are hard to label as " "cell-matching" ' 
or " 'noncell-matching" ' nodes because after training cells match them as well as noncells. 
First we do not regard the proportion of cells and noncells at the uncertain nodes but 
only the fact that they are uncertain. The fraction by the number of nodes in the SOM is 
listed in the table above. 

Disregarded the misclassification rate, we observed that the dominating factor in- 
fluencing the number of uncertain nodes is the size of the SOM. This can be explained 
by the various proportions of the number of nodes on the border line between a cell- 
matching region and a noncell-matching region to the number in the regions themself. 
But comparing the rate of uncertain nodes for equivalent SOM dimensions yields in 
favoriting 20 dimensional feature vectors / 2 o(xi) because the labeling in those cases is 
most certain. 
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cell noncells 



Fig. 4. A 10 X 10 SOM with image patches matching selected nodes. □ denotes a cell which matchs 
a node, x denotes noncell. Image patches matching selected nodes are reproduced on the right. 



5 Discussion 

We explored a feature extraction approach based on the ICA-theory for the task of classi- 
fying cell micrograph image patches. We found ICA features that - unlike PCA features 
- can be interpreted in terms of geometric cell features when the number K of extracted 
ICA features falls in some intermediate range (well below the number of pixels in an 
image patch). Using a SOM for classification of image patches into cell and noncell 
images, we found a non-monotonous dependence of the classification performance on 
the number K of extracted ICA components, with an optimum classification perfor- 
mance correlated with the AT-range for which the resemblance of ICA-components into 
subparts of an “idealized" cell is most salient. This optimum AT-range seems to tag the 
border beween principal component like filters and independent components splitted to 
unspecific shreds. 

The achieved misclassification rate of 0.05 is comparable to that of a PCA-based 
cell classification system [7] and makes the current approach a competitive alternative 
to the PCA-based classifier. While the use of a SOM as a classifier is known to be not 
optimal [6], it provides its classification together with a visualization of the feature 
space. Thereby it allows also to visually explore the classification border and to inspect 
the features characterizing image patches that are most easily confused by the system 
as well as by human experts. This, together with the use of ICA-features with their 
observed, better visual interpretabllity as compared to PCA can help to improve the 
design of more specialized feature extraction schemes for cell micrograph classification 
or similarly difficult visual classification tasks characterized by a high amount of 
degradation. 



This work was partially funded by the BMBF under contract 01 IB OOIB. 
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Abstract. We present a novel approach for representing shape knowl- 
edge in terms of example views of 3D objects. Typically, such data sets 
exhibit a highly nonlinear structure with distinct clusters in the shape 
vector space, preventing the usual encoding by linear principal compo- 
nent analysis (PC A). For this reason, we propose a nonlinear Mercer- 
kernel PCA scheme which takes into account both the projection dis- 
tance and the within-subspace distance in a high-dimensional feature 
space. The comparison of our approach with supervised mixture models 
indicates that the statistics of example views of distinct 3D objects can 
fairly well be learned and represented in a completely unsupervised way. 

Keywords: Nonlinear shape statistics, Mercer kernels, nonlinear density 
estimation, shape learning, variational methods, kernel PCA 



1 Introduction 

One of the central questions in computer vision is how to model the link between 
external visual input and internally represented, previously acquired knowledge. 
For the case of image segmentation, prior information on the shape of expected 
objects can drastically improve segmentation results lomi. A conceptually at- 
tractive way of incorporating prior information is given by a variational approach 
in which external image information and statistically acquired knowledge about 
the shape of expected objects are combined in a single cost functional 

A — E'lYnage . (1) 

The present paper is concerned with the question of how to construct such a 
shape energy, which measures the similarity of a given shape to a set of training 
shapes. We focus on encoding views of distinct objects in an unsupervised way. 

In most of the models of shape variability it is assumed that the training 
shapes define some linear subspace of the shape space ^ . Though quite powerful 
in many applications, this assumption only has limited validity if the observed 
deformations are more complex. It fully breaks down once shapes of different 
classes are included in the training set, such as those corresponding to different 
objects or just different views of a single 3D object. An example is given in 
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Figure ^ which shows a sampling along the first principal component for a set 
of 10 hand shapes containing right and left hands: the assumption of a linear 
distribution obviously results in an unwanted mixing up of the two classes. 



Fig. 1. Mixing of two classes in a Gaussian model: Sampling along the first principal 
component around the mean (center) for a training set of 10 hands, comprising both 
left and right hands. Shapes of different classes are morphed in an undesirable way. 

Several approaches have been undertaken to model nonlinear shape variabil- 
ity. They often suffer from certain drawbacks, namely they assume some prior 
knowledge about the structure of the nonlinearity |S|, or the number of under- 
lying classes j2j, or they involve an intricate model construction |2j. 

An elegant and promising way to avoid these drawbacks is to employ feature 
spaces induced by Mercer kernels P|, in order to indirectly model a nonlin- 
ear transformation <P(x) of the original data from a space X into a potentially 
infinite-dimensional space Y, aspiring a simpler distribution of the mapped data 
in Y . The search for an appropriate nonlinearity is replaced by the search for 
an appropriate kernel function k{x,y) defining the scalar product on Y: 

k{x,y) = {<P{x),<P{y)) . (2) 

With great success, this Mercer kernel approach has been used for the pur- 
pose of classification 0. By contrast, our aim in the present paper is that of 
constructing a similarity measure by probability density estimation. We there- 
fore propose to approximate the nonlinearly mapped data points <P{x) by a 
Gaussian probability density in the high- dimensional space Y . It turns out that 
this can be done in the framework of Mercer kernels, i.e. all nonlinearities can 
be expressed in terms of scalar products. 

The resulting nonlinear density estimate in the original space X does not 
assume any prior information about the number of classes. Comparison with a 
supervised mixture model on simulated 2D data and its application to silhouettes 
of various 3D objects reveals that our estimate captures the essential nonlinear 
structure in the original (shape) space, although being fully unsupervised. 

Our method of density estimation is related to the so-called kernel PCA, 
which shall therefore be reviewed in the next section. 

2 Kernel Principal Component Analysis 

In 1 1 ,'~ij a method to perform nonlinear principal component analysis is proposed. 
This is done by assuming an appropriate nonlinear transformation ^{xi) of the 
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training data into a space Y and performing a linear principal com- 

ponent analysis of the transformed data in Y (after centering it in Y). It is shown 
that the nonlinearity enters the relevant expressions only in terms of scalar 
products ©■ Therefore the choice of an appropriate nonlinear transformation <1> 
corresponds to the choice of an appropriate kernel k{x,y). The eigenvectors in 
Y can be expressed as linear combinations of the mapped training data: 

i 

Vk='£a'i<l>{x.), ( 3 ) 

i=l 

with known coefficients af. The projection of a mapped point 'P{z) on the eigen- 
vector Vfc is therefore given by: 

e 

P^~{VuMz)) = Y,a^,k{x,,z). ( 4 ) 

i=l 

In ^21 this kernel PCA is applied to pattern reconstruction. To this end the 
authors propose to minimize the distance 

p{z) = \\Pr<l>{z)-<l>{z)f ( 5 ) 

of a mapped sample point to its projection onto the subspace spanned by the 
first r eigenvectors: 

r 

PrHz) = Y,PkVk. ( 6 ) 

k^l 

The distance (0 can be expressed in terms of the kernel function 0. For a 
suitable kernel, a corrupted pattern 2 is reconstructed by minimizing Onj. 



3 Density Estimation in Kernel Space 



In the present paper we deviate from the kernel PCA formulation above, namely 
we propose to perform a nonlinear probability density estimation by exploiting 
kernel spaces. We model the statistical distribution of the nonlinearly mapped 
data by a Gaussian distribution in Y . After centering, the covariance matrix in 
Y is given by 



Ag> :=^ <^{xi)<P{xif . 



(7) 



Let be the nonzero eigenvalues of and V the matrix containing 

the respective eigenvectors Vk- In general is not invertible and needs to be 
appropriately regularized (cf. [7]), for example by replacing all zero eigenvalues 
by the smallest non-zero eigenvalue A,.. The inverse of this matrix is: 



(K 



S% = V 



a; 






afV 



-k A-1 ■{! - VV*) . 



( 8 ) 



V 
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Approximating the distribution of mapped points in 1" by a Gaussian density 



V{z) oc exp S% ^{z) 

corresponds (up to scaling) to an energy of the form: 

E{z) = <P{zY a; <P{z) . 

Using definition (0, the energy is split into two terms: 

r ( ^ 

E{z) = + A-i \<l>{z)Y - Y.(V,,<l>{z)y 









Inserting expansion m of the eigenvectors Vk and the kernel 0 we get: 



( 9 ) 



(10) 



( 11 ) 



E{z) = • (A/-A/) + A/-fc(z,z). (12) 



k—1 \i=l 



Again, the nonlinearity only appears in terms of the kernel function. Start- 
ing from a shape vector z, minimization of m increases its similarity to the 
training data {xi}. 

How and why does energy dTHIl differ from distance (0 proposed in U2I? The 
second term in dI3, weighted by A^ ^ , is identical with ©• It corresponds to the 
distance of a mapped point (p{z) to the feature space F, which is the subspace 
of Y spanned by the mapped training data. Following an analogous derivation 
in the linear setting mi, we call this term distance from feature space (DFFS). 
The first term in (El) is called distance in feature space (DIFS). Both of these 
distances are visualized in Figure 0 the original data is mapped from the space 
M" to a (generally higher dimensional) space Y by the nonlinear mapping 
The space Y is the direct sum of F and its orthogonal complement F in Y . 




Fig. 2. Nonlinear mapping into Y = F and the distances DIFS and DFFS. 



In order to measure how similar a point 2 is to the training data {xi}, both 
distances - DIFS and DFFS - need to be included. The DFFS by itself is not 
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sufficient: it completely ignores how the mapped training data is distributed in 
F . Moreover, one can easily imagine the mapped test point ^(z) to be far away 
from the mapped training data, while still being at exactly the same DFFS. 

Including the DIFS as proposed in m accounts for the distance of the 
projection Pr^(z) within F from the mapped training data It is 

the Mahalanobis distance in the feature subspace F. Therefore, m is a more 
reliable measure of the similarity of a test point z to the training data {xi}. 

4 Numerical Results 

4.1 Unsupervised Density Estimation via Kernel Spaces versus 
Supervised Mixture Models 

Given the information which class each training point belongs to, one can con- 
struct a mixture model of Gaussian distributions as a nonlinear extension of 
PGA. For each class i one calculates mean rui and covariance matrix The 
total probability is the sum of the probabilities for each class. The corresponding 
energy is given by: 

E(z) = - ^log 

and 

Ei(z) = ^(z - mi)* (z - rrn) . (14) 

The additional parameter /3 is introduced to allow smoothing. For small values 
of /3 one obtains the weighted sum of the single class energies (HU: 

aEi{z) -I- const for /3 <C 1 . (15) 

LzG ^ 

The limit /3 — >■ oo gives their minimum: lim^_>oo E{z) = min^ Ei{z) -I- const. 

We compared our approach (II 211 for a Gaussian radial basis function kernel P 

k{x,y) ^ (16) 

to the supervised case (H3 on an artificial training set of 2D points, which were 
sampled from three different Gaussian distributions. The training data and the 
level- lines of the respective energies are depicted in Figure 0 

The comparison shows several advantages of our method. The kernel space 
approach is unsupervised: The class membership of a training point is neither 
known, nor determined beforehand. Even the knowledge that the data of each 
class is sampled from a Gaussian distribution is not taken into account. Yet, the 
qualitative comparison shows that the data distribution is approximated better 
than by the mixture model, which is based on the valid assumption of Gaussian 
distributions and which does imply the knowledge about the class membership 
of each point. Accordingly, the density estimate obtained by the mixture model 
is always restricted to ellipse-like level lines. 



exp{-(3Ei{z)) 



where Ci := |27rAi| (13) 
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Fig. 3. Level-lines of the energies corresponding to a supervised mixture model l| I for 
/3 = 1 (left) and j3 = 0.02 (center) and the unsupervised density estimate via kernel 
spaces mu for (7 = 1.5 (right). These figures illustrate that our approach captures 
nonlinear data distributions without the need to classify the training data beforehand. 



4.2 Nonlinear Shape Statistics in Kernel Space 

In order to apply our distance measure da to realistic shapes, we parameter- 
ized the silhouettes of binarized training objects by closed spline curves. The 
spline curves were aligned with respect to Euclidean transformations and cyclic 
renumbering of the control points - see Figure 01 We used 100 control points 




Fig. 4. 3D sample objects, and aligned silhouettes for several views of these objects. 
Applying linear PCA to the training set on the right would not produce an accurate 
description of the shape variability. 



in order to assure a sufficiently detailed contour description. The control point 
vectors were then used as training data to construct the energy (1 1 211 . again us- 
ing the kernel (I I bll . In order to visualize the energy we projected the control 
point vectors of the training contours onto the first two principal components 
of a linear pc4]. The data points and the respective level lines of energy (1 1 211 
are shown in Figure El The projection shows that our density estimate works 
well even in higher dimension^ and for distributions which are not necessarily 

^ Note that linear PCA is only used as a coordinate frame for visualization of the 
high-dimensional data! 

^ Due to the 2D projection, Figure El is merely a crude visualization of how the data 
distribution is approximated in the original 200-dimensional space. 
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Fig. 5. Training shapes and level lines corresponding to the shape density estimate in 
kernel space m, projected onto the first two principal components of a linear PCA. 
Left: Different views of objects 1 (o), 2 (+) and 3 (•) in Figure 0 for a — 0.04. 
Center: Left hands (+) and right hands (•) (used in Figure P) for a — 0.1. Right: 
Hands for a — 0.04. Clusters in high-dimensional shape space are estimated in variable 
detail. 



Gaussian - see Figure El Compared to linear PCA (elliptical level lines) the true 
data distribution is approximated much better. This is crucial since the different 
shapes can be quite similar - see Figure El right side. Moreover, the construction 
of the shape energy is fully unsupervised, i.e. it does not involve the number of 
objects nor the number of clusters, in which the different views of one object can 
be separated. By changing the parameter a in (im . one can choose how detailed 
the approximation of the data should be - see Figure El middle and right. 

Note that we are not interested in classification of the objects, we merely 
want a measure of how similar an object is to a set of training objects given 
their 2D projections. It is therefore irrelevant whether all projections of one 3D 
object can be associated with one cluster. Rather we expect to obtain several 
clusters corresponding to the stable views of each object. 



5 Conclusion 

We presented a method to perform nonlinear density estimation in the framework 
of kernel spaces. A set of training points is mapped to a higher dimensional 
space P by a nonlinear mapping <?. The distribution of mapped points is then 
approximated by a Gaussian distribution in Y. Back projection to the original 
space allows a visualization of the estimated density. Comparison to supervised 
mixture models shows the advantages of our approach - namely that it is fully 
unsupervised and that the data distribution is approximated more appropriately. 
An application of this density estimation to silhouettes of 3D objects shows 
that the density estimate via kernel spaces seems to be well suited for high- 
dimensional and highly nonlinear data distributions. We argued that the distance 
measure corresponding to the density estimation in kernel spaces is more reliable 
than that obtained in kernel PCA |I2|. 



276 D. Cremers, T. Kohlberger, and C. Schnorr 



Ongoing work focuses on ways to automatically estimate the optimal size of 
the parameter a and on the application of the proposed density estimation to 
image segmentation |^. 
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Abstract. This article deals with various aspects of normalization in 
the context of Support Vector Machines. We consider fist normalization 
of the vectors in the input space and point out the inherent limitations. 
A natural extension to the feature space is then represented by the 
kernel function normalization. A correction of the position of the 
Optimal Separating Hyperplane is subsequently introduced so as to suit 
better these normalized kernels. Numerical experiments finally evaluate 
the different approaches. 

Keywords. Support Vector Machines, input space, feature space, nor- 
malization, optimal separating hyperplane 



1 Introduction 

Support Vector Machines (SVMs) have drawn much attention because of their 
high classification performance [1]. In this article, they are applied in a computer 
vision problem, namely the classification of images. SVMs are often combined 
with a preprocessing stage to form pattern recognition systems. Moreover, it 
turns out to be intrinsically necessary for the SV algorithm (see [l]-[4]) to have 
data which is preprocessed. It has been shown [5] that normalization is a pre- 
processing type which plays an important role in this context. Theoretical con- 
siderations on the kernel interpretation of normalization and an adaptation of 
the SV algorithm to normalized kernel functions will be developed in this paper 
in order to shed new light on such pattern recognition systems. 

In this study, we deal with normalization aspects in SVMs. First, normaliza- 
tion in the input space is considered in Sec. 2 and a resulting problem related 
to SV classification is outlined. A possible solution, namely the normalization 
of the kernel functions in the feature space, is then subsequently presented. A 
modification of the SV algorithm is then presented in Sec. 3. This modifica- 
tion takes into account the properties of the normalized kernels and amounts to 
a correction of the position of the Optimal Separating Hyperplane (OSH). The 
corresponding numerical experiments are reported in Sec. 4 and Sec. 5 concludes 
the paper. 
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2 Normalization in the Input and Feature Space 



The normalization of the vectors of the input space can be considered as the 
most basic type of preprocessing. Assume x G is an input vector, the corre- 
sponding normalized vector x may be expressed as: 



X 







( 1 ) 



where ||a;|p = This vector lies on a unit hypersphere of Thus, 

if the input vectors are images, normalization in the input space amounts to 
rescaling the intensity of the pixels of the images since such a preprocessing 
only changes the norm of the image vector. The SV algorithm is constructed 
to find the OSH in the feature space, the latter being obtained by a non-linear 
mapping from the normalized input space. When considering the effect of such 
a mapping on the normalized input vectors, it appears, in most cases, to cause a 
loss of normalization or a scale problem as shown in figure 1. This may create a 




input sapce 



normalized input space 



feature space 



Fig. 1. Normalization in the input space and its representation in the feature space 



problem for the SV algorithm since the latter mainly needs “input” vectors in the 
feature space which are in some way “scaled” . As suggested in [5] , normalization 
in the feature space presents a solution to this problem. 

Normalization in the feature space is not strictly-speaking a form of prepro- 
cessing since it is not applied directly on the input vectors but can be seen as a 
kernel interpretation of the preprocessing considered above, i.e., an extension to 
the feature space of the normalization of the input vectors. Normalization in the 
feature space essentially amounts to redefining the kernel functions of the SVM 
as it is applied to the unprocessed input vectors. Moreover, since the non-linear 
mapping is not known, this normalization only makes use of the kernel functions. 
Assume K(x,y) is the kernel function representing a dot product in the feature 
space. Normalization in the feature space then amounts to defining a new kernel 
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function K{x,y) as follows: 

K{x,y) = 



K(x,y) 



( 2 ) 



We clearly have K{x,x) = 1. Notice that this is always true for RBF kernels 
K{x,y) = ). Thus, all vectors in the feature space lie on a unit 

hyper-sphere. For monomial kernels K(x,y) = (a: • y)P, normalization in the 
input space is equivalent to normalization in the feature space. Indeed, we have: 

K{x,y) = {x-yY = = K{x,y). When the kernel function is replaced 

by the dot product in the input space {p = 1), equation (0 reduces to equation 
(P. Moreover, note that K{x,y) = tp(a;)-(p(y) where Cp{x) = 

stands for the “normalized” mapping and thus the expression above satisfies the 
conditions of Mercer’s theorem. 

When considering single-class SVMs as introduced in [6], the normalized 
kernel functions play a predominant role. We consider dealing with data rescaled 
such that it lies in the positive orthant i.e. in [0,oo)^. The normalized kernels 
then place the datapoints on a portion of the unit hypersphere in the feature 
space allowing them to be separated from the origin by a hyperplane. 



3 Adaptation of the SV Algorithm in a Normalized 
Feature Space 

Normalization in the feature space changes the kernel functions, and thus also the 
SV optimization problem. We shall here study the implications of a normalized 
feature space, i.e. of normalized kernel functions, on the SV algorithm. The 
latter determines the OSH defined by its normal vector w and its position b. 
By construction, the margins of separation are symmetric around the OSH since 
both lie at a distance S = ||:^ from the OSH. However, all the datapoints lie 
on a unit hypersphere in the normalized feature space. It would thus be more 
accurate to do classification not according to an OSH computed such that the 
margins are symmetric around it, but according to an OSH determined such 
that the margins define equal distances on the hypersphere. This may be done 
by adjusting the value of b. The margins, and thus w, are unchanged since 
the problem is symmetric around w. In other words, the separating hyperplane 
is translated. In order to compute the correction to the value of b, consider 
figure El The intersection of the two margin hyperplanes with the unit hyper- 
sphere are represented by the angles a\ and «2 defined by cos(o;i) = b — S and 
cos(a 2 ) = b + S. The bisection of the angle formed by ai and 02 is represented 
by the angle ip and can be computed as p = ^ 

Moreover, we set cos(p) = b' where b' stands for the new position of the OSH. 
Finally we get the following expression: 



6^(6, w) = cos 



( arccos{b- ^) + arccos{b+ 






( 3 ) 
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Fig. 2. Normalization in a two-dimensional feature space and computation of h' 



From the above equation, we see that b ^ b' . The correction of b mentioned here 
leads to a new optimal separating hyperplane defined by {w,b'). This method 
is valid regardless of the kernel function as long as it is normalized according 
to equation 0 The value of b'{b,w) can be computed from equation 0 once 
the optimal parameters of the separating hyperplane (w, b) are found by the 
SV algorithm. The correction is not applied while the optimization process is 
running since this would create convergence problems. 



4 Numerical Experiments 



Here, we construct a pattern recognition system formed by a SVM with either 
input space normalization, feature space normalization or the latter combined 
with the correction of the position of the OSH. 

Database. The Columbia Object Image Library (COIL-100) which can be down- 
loaded from http://www.cs.columbia.edu/CAVE/ was chosen as in [7] but for 
different training and testing protocols (see underneath) . The latter is composed 
of 100 different objects, each one being represented by 72 color images (one 
perspective of the object every 5 degrees) of size 128x128 pixels. These images 
were first converted to greyscale images and reduced to 32x32 pixels images 
using averages over square pixel patches of size 4x4 pixels. The database was 
separated into a training and a testing dataset. For each object the perspectives 
0-30-60-. . .-330 went into the training set and the perspectives 15-45-75-. . .-345 
into the testing set. In other words, both datasets are composed of regularly- 
spaced non-overlapping perspectives of each object. We are thus confronted with 
a multi-class classification problem and the choices mentioned below are made 
for the training and for the classification protocols. 



Normalization in Support Vector Machines 



281 



Training. For each object i = 1, . . . , 100, a classifier Ci is generated by assigning 
a target +1 to the training images of the object i and a target -1 to the training 
images of all the remaining objects. We thus choose a “one against all” strategy. 
The regularisation parameter is set a priori to 1000. 

Classification. Each testing image 2 : is presented to each of the classifiers and 
is assigned to class i where Ci{z) > Ck{z) 'ik ^ i according to a “winner take 
all” strategy. As shown in [8], this approach is computationally very efficient 
for feedforward neural networks with binary gates such as those encountered in 
SVMs. Since the class of the testing images is known, an error is computed for 
each test image. The classification error for the experiment is then the average 
of all the individual errors for each Ci and the corresponding variance over the 
objects is also computed. 

Polynomial kernel functions K{x, y) = (1 + xy)P are particularly well suited for 
the considered studies. The results are presented in Figure 3. When considering 



Polynomial kernels 




Fig. 3. Classification error and variance with normalization in the input space, in the 
feature space or in the feature space with correction of b 



the results, we notice that the three error curves are monotonically decreasing 
and none are crossing each other. This seems to reflect the good generalization 
ability of the considered algorithms. The experiments clearly point out that the 
normalization in the feature space outperforms the normalization in the input 
space for all the considered degrees except for p = 1. Indeed, for linear kernels, 
input and feature space normalizations are of the same order of magnitude for 
large values of the input vectors. This is the case here since the latter represent 
vectors of size 1024 with values ranging between 0 and 255. Thus, for higher 
values of p (more sensitive the kernel functions), the difference between these 
two normalizations gets more pronounced, yielding a bigger difference in the 
classification performance. Furthermore, the correction of the position of the 
OSH for normalized kernels decreases the classification error further (albeit not 
significantly) for the four most sensitive kernels. Again, the linear case is left 
unchanged since both normalizations are then almost identical. The correction 
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brought to the value of b by the previously-introduced method can be measured 
as being — 5%. 

5 Conclusions 

When considering a classification machine in computer vision, the preprocessing 
stage is of crucial importance. In particular when dealing with SVMs, this stage 
can influence dramatically the results of the classification. In this article, we 
mentioned that considering the preprocessing stage can be equivalent to study- 
ing the kernel functions of SVMs. In this perspective, we discussed one of the 
most basic types of preprocessing, namely normalization. Normalization was first 
considered in the input space and it was noticed that this preprocessing was not 
appropriate when considering SVMs. A natural extension is to move into the 
feature space by considering normalization of the kernel functions. Since classi- 
fication is performed in this space, a correction of the position of the OSH was 
introduced to better suit the normalized kernels. This novel algorithm is shown 
to have the same optimal solutions for w as the standard SV algorithm, but the 
considered correction is introduced in the final computation of the position of 
the OSH. Numerical experiments corroborated that normalization in the feature 
space outperformed the one in the input space and that the correction of the SV 
algorithm was revealed to be most effective. 
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Abstract. Support vector machines (SVM) are learning algorithms de- 
rived from statistical learning theory. The SVM approach was originally 
developed for binary classification problems. In this paper SVM architec- 
tures for multi-class classification problems are discussed, in particular we 
consider binary trees of SVMs to solve the multi-class problem. Numeri- 
cal results for different classifiers on a benchmark data set of handwritten 
digits are presented. 

1 Introduction 

Statistical learning theory developed by V. Vapnik formalizes the task of learning 
from examples and describes it as a problem of statistics with finite sample 
size Q. Originally, the SVM approach was developed for two-class or binary 
classification. The V-class classification problem is defined as follows: Given a 
set of M training vectors {x ^ with input vector G and with 
G {I,-- - ,N} as the class label of input x^. Find a decision function F : 
— )> {1, . . . , N} mapping an input cc to a class label y. Multi-class classification 
problems (where the number of classes N is larger than 2) are often solved using 
voting schemes based on the combiniation of binary decision functions. One 
approach is constructing N binary classifiers (e.g. a SVM network), one for 
each class, together with a maximum detection across the classifier outputs to 
classifiy an input vector x. This one-against-rest strategy is widely used in the 
pattern recognition literatur. Another classification scheme is the one-against- 
one strategy, where (^) binary classifiers are constructed — separating each pair 
of classes, together with a majority voting scheme to classify the input vectors. 
A different approach to solve a iV-class pattern recognition problem is to build 
a hierachy or tree of binary classifiers. Each node of the graph is a classifier 
performing a predefined classification subtask. In this procedure the hierarchy 
of subtasks has to be determined before the classifiers are trained. 

2 Support Vector Learning 

In this section we briefly review the basic ideas of support vector learning and 
present four multi-class classification techniques which may be applied to SVMs. 
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SVMs were initially developed to classify data points of linear separable data sets 
m- In this case a training set consisting of M examples {x^, y^), and 

y^ G {—1, 1} can be divided up into two sets by a separating hyperplane. Such 
a hyperplane is determined by a weight vector b G and a bias or threshold 
0 G K satisfying the separating contraints 

y^((x^,b)+0)>l y=l,...,M. 




(a) Optimal separating 
hyperplane with a large 
margin. 



(b) Separating hyper- 
plane with a smaller 
margin. 



Fig. 1. Binary classification problem. The examples of the two different classes are 
linear separable. 



The distance between the separating hyperplane and the closed data points 
of the training set is called the margin. The separating hyperplane with maximal 
margin is unique and can be expressed by a linear combination of those training 
examples (so-called support vectors) lying exactly at the margin has the form 

M 

H(x) = -bap. 

Here a*, . . . , is the solution optimizing the functional 

M M 

^—1 

subject to the constraints > 0 for all ^ = 1, . . . ,M and ‘Xp.y^ = 0. 

Then a training vector x^ is a support vector if the corresponding coefficient 
a* > 0. Then it is 6 = J2^=i o^fj.x^ and the bias Oq is determined by a single 
support vector (x®, y®): Oq = y® — (x®, &). 
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The SVM approach can be extended to the nonseparable situation and to 
the regression problem. In most applications (regression or pattern recognition 
problems) linear solutions are insufficient, so it is common to define an appropri- 
ate set of nonlinear mappings g := (gi, g 2 , ■ ■ ■), transforming the input vectors 
into a vector g{x^) which is element of a new feature space 'H. Then the 
separating hyperplane can be constructed in the feature space "H. Provided "H is 
a Hilbert space, the explicit mapping g(x) does not need to be known since it 
can implicitly defined by a kernel function K{x,x^) = {g{x),g{x^)) representing 
the inner product of the feature space. Using a kernel function K satisfying the 
condition of Mercer’s theorem (see |E]), the separating hyperplane is given by 

M 

H{x) = ^ a^y^K{x,x^) + oq. 

/i^l 

The coefficients can be found by solving the optimization problem 

M M 

fl—1 ^,U — 1 

subject to the contraints 0 < < C for all ^ = 1, . . . ,M and = 0 

where C is a predefined positive number. An important kernel function satisfy- 
ing Mercers condition is the Gaussian kernel function (also used in this paper) 
_ Il*-I/Il2 
K{x,y) = e . 

3 Multi-class Classification 

In many real world applications, e.g. speech recognition, or optical character 
recognition, a multi-class pattern recognition problem has to be solved. The 
SVM classifier is a binary classifier. Various approaches have been developed in 
order to deal with multi-class classification problems. The following strategies 
can be applied to build iV-class classifiers utilizing binary SVM classifiers. 

One-against-rest classifiers. In this method N different classifiers are con- 
structed, one classifier for each class. Here the ^-th classifier is trained on the 
whole training data set in order to classify the members of class I against the 
rest. For this, the training examples have to be re-labeled: Members of the Fth 
class are labeled to 1; members of the other classes to —1. In the classification 
phase the classifier with the maximal output defines the estimated class label of 
the current input vector. 

One-against-one classifiers. For each possible pair of classes a binary classi- 
fier is calculated. Each classifier is trained on a subset of the training set contain- 
ing only training examples of the two involved classes. As for the one-against- 
rest strategy the training sets have to be re-labeled. All N{N — l)/2 classifiers 
are combined through a majority voting scheme to estimate the final classifica- 
tion Here the class with the maximal number of votes among all A^(A^— 1)/2 
classifiers is the estimation. 



286 



F. Schwenker 



Hierarchies/trees of binary SVM classifiers. Here the multi-class classifica- 
tion problem is decomposed into a series of binary classification sub-problems 
organised in a hierarchical scheme; see Figure |21 We discuss this approach in the 
next section. 



{A,B,C,D,E,F} {A,B,C,D,E,F} 





(a) Binary tree classifier (b) General hierarchical classi- 

fier 



Fig. 2. Two examples of hierarchical classifiers. The graphs are directed acyclic graphs 
with a single root node at the top of the graph and with terminal nodes (leaves) at the 
bottom. Individual classes are represented in the leaves, and the other nodes within 
the graph are classifiers performing a binary decision task, which is defined through 
the annotations of the incoming and the outgoing edges. 



Weston and Watkins proposed in m an extension of the binary SVM ap- 
proch to solve the A^-class classification problem directly. 

4 Classifier Hierarchies 

4.1 Confusion Classes 

One of the most important problems in multi-class pattern recognition problems 
is the existence of confusion classes. A confusion class is a subset of the set of the 
classes {!,... , N} where the feature vectors are very similar and a small amount 
of noise in the measured features may lead to misclassifications. For example, 
in OCR the measured features for members of the classes o, 0, 0 and Q are 
typically very similar, so usually {o, 0, 0, Q} defines a confusion class. The 
major idea of hierarchical classification is first to make a coarse discrimination 
between confusion classes and then a finer discrimination within the confusion 
classes |S|. 

In Figure El examples of hierarchical classifiers are depicted. Each node within 
the graph represents a binary classifier discriminating feature vectors of a con- 
fusion class into one of two smaller confusion classes or possibly into indivi- 
ual classes. The terminal nodes of the graph (leaves) represent these individual 
classes, and the other nodes are classifiers performing a binary decision task, 
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thus these nodes have exactly two children. Nodes within the graph may have 
more than one incoming edge. Figure shows a tree-structured classifier, where 
each node has exactly one incoming edge. In Figure |2b a more general classifier 
structure defined through a special directed acyclic graph is depicted. In the 
following we restrict our considerations to tree structured SVMs. 

4.2 Building Classifier Trees by Unsupervised Clustering 

The classification subtask is defined through the annotations of the incoming and 
outgoing edges of the node. Let us consider for example the SVM classifier at the 
root of the tree in Figure Et- The label of the incoming edge is {A, . . . , F}, so for 
this (sub-)tree a 6-class classification task is given. The edges to the children are 
annotated with {A,C,D} (left child) and {B,E,F} (right child). This means 
that this SVM has to classify feature vectors into confusion class {A,C,D} or 
{B, E, E}. To achieve this, all members of the six classes {A, ... ,E} have to be 
re-labeled: Feature vectors with class labels A, C, or D get the new label — 1 
and those with class label B, E, or F get the new label 1. After this re-labeling 
procedure the SVM is trained as described in the previous section. Note, that 
re-labeling has to be done for each classifier training. 

We have not answered the question how to construct this subset-tree. One 
approach to construct such a tree is to divide the set of classes K into disjoint 
subsets Ki and K 2 utilizing clustering. In clustering and vector quantization a 
set of representative prototypes {ci, . . . , Cfc} C is determined by unsupervised 
learning from the feature vectors x^, = 1, . . . ,M of the training set. For each 

prototype cj the Voronoi cells Rj and clusters Cj are defined by 

Rj := {a; G : \\cj — x \\2 = min \\ci — a:|| 2 } 

i 

and 

Cj :=Rjn{xf^ : /r = l,... ,M}. 

The relative frequency of members of class i in cluster j is 

For class i the set 17^ is define by 

f2i = {x>" : /r = 1, . . . , M, = i}. 

The fc-means clustering with k = 2 cluster centers ci and C 2 define hyperplane 
in the feature space separating two sets of feature vectors. From the corre- 
sponding clusters Ci and C 2 a partition of the classes K into two subsets Ki 
and K 2 can be achieved through the following assignment: 

Kj := {i€ K : j = argmax {pn,Pi 2 }}, j = 1, 2. 

Recursively applied, this procedure leads to a binary tree as depicted in Figure El 
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5 Application and Conclnsion 

The data set used for evaluating the performance of the classifier consists 
of 20,000 handwritten digits (2,000 samples per class). The digits, normal- 
ized in height and width, are represented by a 16 x 16 matrix {gij) where 
gij G {0, . . . , 255} is a value from a 8 bit gray scale (for details concerning 
the data set see 0). The whole data set has been divided into a set of 10,000 
training samples and a set of 10,000 test samples. The training set has been 
used to design the classifiers, and the test set for testing the performance of the 
classifiers. 













Fig. 3. Exampels of the handwritten digits. 

Results for the following classifiers and training procedures are given: 

MLP: Multilayer perceptrons with a single hidden layer of sigmoidal units 
(Fermi transfer function) trained by standard backpropagation; 100 training 
epoches; 200 hidden units. 

INN: 1-nearest neighbour classifier. 

LVQ: 1-nearest neighbour classifier trained with Kohonen’s software package 
with OLVQl and LVQ3; 50 training epoches; 500 prototypes. 

RBF: RBF networks with a single hidden layer of Gaussian RBFs trained by 
standard backpropagation with 50 epoches training the full parameter set; 200 
hidden units each with a single scaling parameter. 
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SVM-l-R: SVM with Gaussian kernel function; one-against-rest strategy; NAG 
library for optimisation. 

SVM- 1-1: As SVM-l-R but with the one- against- one strategy. 

SVM-TR: Binary tree of SVM networks. The classifier tree has been build by 
fc-means clustering with fc = 2. In Figure 0 a representative tree is depicted 
which was found by clustering experiments and which was used for the training 
of the tree of SVMs. 




Fig. 4. Tree of subclasses for the handwritten digits data set. 



For this data set further results may be found in the final StatLog report 
(see p. 135-138 in P] and in jS|). The error rates for INN, LVQ, and MLP 
classifiers are similar in both studies. The INN and LVQ classifiers perform well. 
RBF networks trained by backpropagation learning and support vector learning 
show the best classification performance. A significant difference between the 
different multi-class classification strategies one-against-rest, one- against- one, 
and the binary SVM classifier tree could not be found. 



Table 1. Results for the handwritten digits. 



Classifier 


MLP 


INN 


LVQ 


RBF 


SVM-l-R 


SVM-1-1 


SVM-TR 


error [%] 


2.41 


2.34 


3.01 


1.51 


1.40 


1.37 


1.39 



We have presented different strategies for the V-class classification problem 
utilising the SVM approach. In detail we have discussed a novel tree structured 
SVM classifier architecture. For the design of binary classifier trees we introduced 
unsupervised clustering or vector quantisation methods. We have presented nu- 
merical experiments on a benchmark data set. Here, the suggested SVM tree 
model shows remarkable classification results which were in the range of other 
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classifier schemes. Similar resluts have been found for the recognition of 3D ob- 
jects (see 0), but further evaluation of this method experiments with different 
multi-class problems (data sets with many classes) have to be made. 
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Abstract. We introduce a novel approach to gesture recognition, based 
on Pseudo-3D Hidden Markov Models (PSDHMMs). This technique is 
capable of integrating spatially and temporally derived features in an 
elegant way, thus enabling the recognition of different dynamic face- 
expressions. Pseudo-2D Hidden Markov Models have been utilized for 
two dimensional problems such as face recognition. PSDHMMs can be 
considered as an extension of the 2D case, where the so-called super- 
states in P3DHMM encapsulate P2DHMMs. With our new approach 
image sequences can efficiently and successfully be processed. Because 
the ’ordinary’ training of PSDHMMs is time expensive and can destroy 
the 3D approach, an improved training is presented in this paper. The 
feasibility of the usage of PSDHMMs is demonstrated by a number of ex- 
periments on a person independent database, which consists of different 
image sequences of 4 face-expressions from 6 persons. 



1 Introduction 

The recognition of facial expression is a research area with increasing importance 
for human computer interfaces. There are a number of difficulties due to the 
variation of facial expression across the human population and to the context- 
dependent variation even for the same individual. Even for humans, it is often 
difficult to recognize the correct facial expression (see |2]), especially on a still 
image. 

An automatic facial expression recognition system can be generally decom- 
posed into three parts. First the detection and location of faces in a scene, second 
the facial expression feature extraction and the facial expression classification. 

The first part has been already studied by many researchers and because of 
the difficulties to find faces in a cluttered scene, it seems that the most successful 
systems are based on neutral networks as presented in 0. In this paper, we do 
not address the face detection problem. 

Facial expression feature extraction deals with the problem of finding fea- 
tures for the most appropriate representation of the face images for recognition. 
There are mainly two options to deal with the images. On the one hand a geo- 
metric feature-based system is used, as for e.g. in p. These features are robust 
to variation in scale, size, head orientation and location of the face in an image, 
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but the computation is very expensive. The second approach for extracting fea- 
tures is to deal with the entire pixel image. For this purpose one can use signal 
processing methods such as Discrete Cosine Transformation (DCT) or Garbor 
Transformation (see P|), to obtain another representation of the image. 

The facial expression classification module computes the result of the rec- 
ognized expression. In many recognition systems, neural networks are used. In 
this paper the classification part is realized by means of Hidden Markov Models 
(HMMs). A good overview is given in jOj. Because our task is to recognize facial 
expressions on image sequences instead of a single picture, the use of Pseudo- 
3D Hidden Markov Models (PSDHMMs) is practicable. For image recognition 
Pseudo-2D Hidden Markov Models were applied with excellent results (see |3], 
0)- PSDHMMs have been previously utilized by Muller et al. jSj for im- 
age sequence recognition on a crane signal database consisting of 12 different 
predefined gestures. 

2 Data Set and System Overview 

For the training and test database we recorded video sequences with 25 frames 
per second with a resolution of 320 x 240 pixels. The camera was adjusted in the 
way that the head was almost in front pose. Original images have been rescaled 
and cropped such that the eyes are roughly at the same position (resolution: 
196 X 172 pixels) The database contains 96 takes of 6 persons. Each sequence 
has 15 frames. The image sequence starts with a neutral facial expression and 
changes to one of the 4 categories of expression which are anger, surprise, disgust 
and happiness. Fig. 1 shows the different expressions in our database. It shall 
be pointed out again here that these expressions are dynamic expressions, where 
the real meaning of the expressions can be only identified after the complete se- 
quence has been observed. The still images in FigQare frames captured mostly 
at the end of each sequence. The complexity of the recognition task is largely re- 
sulting from the fact that each person has a different personal way of performing 
each expression and that some individuals in the test database preferred a more 
moderate rather than an expressive way of performing the face gestures, where 
even human observers had difficulties of identifying these expressions correctly 
on the spot. 




Fig. 1. Facial expression database: Examples for anger, surprise, disgust and happiness, 
from left to right 
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The recognition system consists of three processing levels: preprocessing, fea- 
ture extraction and statistical classification. The best results achieved we with 
the following system. Starting with the original image sequence of the face ex- 
pression, the preprocessing calculates the difference image sequence. The feature 
extraction uses DCT-transformation to produce a vector for each frame of the 
difference image sequence. A P3DHMM based recognition module classifies the 
face expression, represented by the feature vector sequence. The output of the 
system is the recognized expression. 



2.1 Preprocessing 



The difference image sequence is calculated by subtracting the pixel value at the 
same position (x,y) of adjacent frames of the original image sequence. 

D'{x, y, t) = P{x, y, t) - P{x, y,t-l) (1) 



Thus the difference image contains positive and negative pixel values. In order 
to reduce noise we apply a threshold operation to the difference image. Every 
pixel with an absolute value smaller than the threshold is set to zero. 



D{x,y,t) 



0 ■.\\D'{x,y,t) <S\\ 

D'{x,y,t) : \\D'{x,y,t) > Alj 



(2) 



Obviously, the size of the grey values of the difference image in such a frame 
indicates the intensity of the motion for each spatial position (x,y) of the image. 
If one imagine the pixel values of the difference image as a ’’mountain area” of 
elevation D(x,y) at point (x,y), then this mountain area can be approximately 
considered as a distribution of the movement over the image space in x- and 
y-direction. 



2.2 Feature Extraction 



Each image of a sequence is scanned with a sampling window from top to bottom 
and from left to right. The pixels in the sampling window of the size N x N are 
transformed using a ordinary DCT according to Eq.3. 



N-l N-1 

C{u,v) = a{u)a{v) EE f{x,y) - cos 

a :— 0 y —0 



(2x + 1)m7t 
2iV 



cos 



(2y + l)vK 
2iV 



( 3 ) 



The sampling window is shifted so that we have a certain overlapping to a previ- 
ous frame. To reduce the size of the feature vector only the first few coefficients, 
located on the left top triangle of the transformed image block, are taken, be- 
cause these coefficients contain the most important information of the image. 
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2.3 Statistical Classification 

The recognition of image sequences can be considered as a 3-dimensional pat- 
tern recognition problem, where the first 2 dimensions are characterized by the 
horizontal and vertical components of each image frame and the 3rd dimen- 
sion is the time. Due to very encouraging results we obtained recently on other 
related problems in that area (see |S1), we favored the use of Pseudo-3D Hid- 
den Markov Models for this demanding task. P3DHMMs can be considered as 
a very new paradigm in image sequence processing and have been mainly pio- 
neered by our research group for image sequence recognition problems, where 
time- varying as well as static information is of crucial importance for the recog- 
nition performance. The reason that facial expression recognition is a problem 
that falls exactly under this category can be explained in the following way: As 
already previously mentioned, one possible approach to facial expression recog- 
nition would be the localization of facial cues such as eye brows or lips in order 
to use these parameters as descriptors for typical expressions. This would have 
firstly resulted into enormous problems in detecting the correct cues (especially 
in person-independent mode) and it would have been secondly very difficult to 
evaluate a dynamic sequence of these cues. By investigating the entire face image 
as is done in our approach, basically all face areas are used as cues and we do not 
have the problem to miss any of these cues, either in defining or detecting them. 
On the other hand, processing the entire face includes also the danger that the 
system degenerates to a face recognition system instead of a facial expression 
recognition system. An unknown template might for instance been assigned to 
a reference pattern not because it contains the right expression but because the 
overall face is close to the face represented in the reference pattern. Additionally, 
the facial expression can be only correctly identified if the change over time is 
taken into account. Therefore, timing as well as static information (in our case 
the cues implicitly included in an image frame) has to be evaluated for high 
performance facial expression recognition and exactly this is our motivation for 
the usage of P3DHMMs. Fig. 2 shows the principal structures of P2DHMMs and 
P3DHMMS. 

HHMs are finite non-deterministic state machines with a fixed number of 
states N with a N-dimensional associated output density functions b as well as 
transition probabilities, described by a x A^-dimensional transition matrix ai j . 




Fig. 2. General structures of P2DHMM and P3DHMM 
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So a HMM \{Tv,a,b) is fully described by 

State set 

Possible output set 
State transition probability 
k < N Output density function 
Initial start distribution 

(4) 

where qt denotes the the actual state at time t. After training the different 
models A with the Baum- Welch algorithm, the probability P{0 \ A) for a given 
observation sequence O = O 1 O 2 • • • Ot (with Oi G V) can be evaluated with the 
Viterbi algorithm. The model with the highest probability is assigned with the 
observation. For a detailed explanation see |^. 

A P2DHMM is an extension of the one-dimensional HMM paradigm, which 
has been developed in order to model two-dimensional data. The state align- 
ment of adjacent columns is calculated independently of each other, thus the 
model is called pseudo 2-dimensional. The states in horizontal direction are 
called superstates, and each superstate encapsulates a one-dimensional HMM 
in vertical direction. By applying this encapsulation principle a second time to 
the 2D HMMs, we can obtain the P3DHMM structure with superstates that 
can be interpreted as start-of-image states. Each superstate now consists of a 
P2DHMM. Samaria shows in 0, that a P2DHMM can be transformed into an 
equivalent one-dimensional HMM by inserting special start-of-line states and 
features. Consequently, it is also possible to transform PSDHMMs into ID lin- 
ear HMMs, which is a crucial procedure for the feasibility of the training and 
recognition process associated with HMMs. 

Fig. 3 shows an equivalent ID-HMM of a 4 x 3 x 2 P3DHMM with start-of- 
line states. These states generate a high probability for the emission of start-of- 



S = {Sl, S2, Sjv} 

V = {vi,V2, -,Vm} 

(aij) = P[qt+i = Sj \qt = Si] ,l<ij < N 
bi{k) = P[vk a.t t\ qt = Si] , 1 < * < IV, 1 < 

TTi = P[qi = Si] , 1 < * < ^ 




Fig. 3. Augmented 4x3x2 P3DHMM with transitions(arrows), states(circle) and 
superstates(crossed) 
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line features. To distinguish the start-of-line features from all possible ordinary 
features, the features of these have to be set to completely different values. Also 
the feature vectors have to be edited, so that the assignment for the states works 
correctly. Values for the superstates have to be inserted after each line for the 
start-of-line states and after each picture for the start-of-image states. The major 
advantage of this transformation is the fact that it is possible to train these 
models with the standard Baum- Welch algorithm and to use the well-known 
Viterbi algorithm for recognition. 



3 Experiments and Results 

The recognition system was tested on different sizes for the HMMs and different 
feature extraction parameters. The best results were achieved with a DCT block 
with a size of 16 x 16 pixels with 75% overlapping. The feature vectors consisted 
of the first 10 coefficients of the DCT block matrix. For the recognition, four 
superstates with (4 x 4) P2DHMMs per superstate have been used as configu- 
ration of the PSDHMMs. A single P3DHMM is trained for each face expression 
using 3 sequences at a time of 6 different persons. 

Because the full training of P3DHMMs is very expensive in time, a novel im- 
proved training approach is introduced here: Since a P3DHMM contains P2D- 
HMMs, the training of the models can be broken down into two parts: First 
all image sequences representing one facial expression are subdivided into sec- 
tions, more or less equally distributed over the length of the sequence. Thus our 
P3DHMM contains 4 P2DHMMs, we use 4 sections. Since one can expect that 
the outcome of the P3DHMM learning procedure is that roughly each quarter of 
the sequence will be assigned to a corresponding P2DHMMs of the 4 superstates 
in the P3DHMM, each isolated P2DHMM was first separately trained on the cor- 
responding quarter of the image sequence. Afterwards the trained P2DHMMs 
were combined to a P3DHMM. That can be done very easily. Between the start- 
of-image states the states of the P2DHMMs can be inserted. Only the transitions 
have to be modified, especially the transition to the end state of the P2DHMM 
have to be redirected to the next start-of-image state and so to the beginning of 
the next P2DHMM. For initialization, the P3DHMM is trained with the com- 
plete sequences. The total time saving of the training is about 50% of the normal 
training. Another advantage of that kind of training is the initialization of the 
time structure into the P3DHMM. 

Table 1 shows the recognition rates achieved in the experiments and displays 
the improvement through the use of multiple Gaussian mixtures. For this the 
probability density function becomes a sum of several Gaussian mixtures so that 
the feature vectors are devided into statistically independent streams. Also here 
the training could be done faster, if the containing P2DHMMs mixed up sepa- 
rately, before the combined P3DHMM is trained, in Section 2, we believe that 
the person-independent recognition of the facial expressions with an accuracy 
of close to 90% is an already very respectable result. We expect even further 
improvements in the near future. 
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Table 1. Recognition rates achieved in the experiments 



rec.rate 


1 Mixture 


2 Mixture 


3 Mixture 


4 Mixture 


5 Mixture 


on rank 1 


70.83% 


75.00% 


79.17% 


87.50% 


87.50% 


on rank 2 


79.17% 


87.50% 


87.50% 


95.83% 


95.83% 


on rank 3 


95.83% 


100.00% 


100.00% 


100.00% 


100.00% 



4 Summary 

Facial expression recognition based on pseudo three-dimensional Hidden Markov 
Models has been presented. This modeling technique achieved very good recogni- 
tion rates with simple pre-processing methods, due to the powerful classification 
capabilities of the novel P3DHMM paradigm. The problem of the time intensive 
training effort for the PSDHMMs has be elegantly solved due to our new ini- 
tialization and partial training techniques. The advantage of PSDHMMs mainly 
lies in the integration of spatial and temporally derived features, and their joint 
warping capabilities. Thus, we strongly believe that this novel approach has clear 
advantages over other, more traditional approaches in such a special area as fa- 
cial expression recognition, where we need exactly the capabilites provided by 
PSDHMMs. Our results seem to confirm and underpin this assumption. 
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Abstract. We present a photogrammetric system for on-line pose mea- 
surement of a robot. The system is based on photogrammetric measure- 
ment techniques, namely resection. We describe the theoretical founda- 
tions of our approach as well as early details of our implementation and 
hardware set-up. The results achieved are compared to those of a com- 
mercial ball-bar system. 



1 Introduction 

While industrial robots typically can achieve a repeatability of 0.1 mm or less, 
their absolute accuracy can be in the range of only 1mm or worse. Many new ap- 
plications in robotics, including flexible optical measurement, require improved 
absolute accuracy. Photogrammetry has been used for several years to perform 
off-line calibration in order to increase accuracy. However when operating a 
robot under shop-floor conditions the accuracy is expected to decrease again 
over time. Clearly an on-line pose correction is desirable. We present a system 
for the on-line pose measurement based on photogrammetry. 



2 Photogrammetric System 



Photogrammetric measurement is based on the classical collinearity equations 
defining the condition, that the projection center, a point in object space and 
its corresponding point on the image plane are on a straight line. Let c be the 
principal distance of the camera, (Aq, Yq, Zq) the principal point, (A, F, Z) the 
coordinates of the object point and (cc, y) the coordinates of the corresponding 
image point. Then the collinearity equation is (see 0) 



rr- rr _ „ (^-W)rii-Ky-Y,)ri2-Kg-gc)ri3 
X - xo (^X-Xc)r3i + (Y-Y^)r32 + {Z-Z^)r33 

_ JX-X,)r2i + {Y-Y,)r22 + {Z-Z,)r23 
y yo >^(^X-X^)r3i + (Y-Yc)r32 + iZ-Zc)r33 



Zt 

= Xo-Cj^ 

Zy 

= yo-cw 



with the condition that are the elements of a 3-D orthogonal rotation 
matrix. For many applications in photogrammetry the rotation matrix R is pa- 
rameterized using Cardan angles uj, (j), k. However, it has shown to be favorable 
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(see |1 flj l to use a parameterization using quaternions. We use Hamilton nor- 
malized quaternions to describe a rotation matrix: 

r 0 — c b 



i?= 5 = 




0 —a 
a 0 



2.1 Camera Model 

While the basic camera model in photogrammetry is the pin-hole camera, addi- 
tional parameters are used for a more complete description of the imaging device. 
The following parameters are based on the physical model of D. C. Brown (P). 
We follow the notation for digital cameras presented by C. S. Fraser (|5]). Three 
parameters Ki,K2 and are used to describe the radial distortion. Two pa- 
rameters Pi and P2 describe the decentring distortions. Two parameters bi and 
62 describe a difference in scale between the x- and y-axis of the sensor and 
shearing. To obtain the corrected image coordinates {x,y) the parameters are 
applied to the distorted image coordinates {x',y') as follows 

X = x' — Xq 
y = y' -yo 

Ax = xr'^Ki + xr*K2 + xr^K^ + ( 2 x^ + r^)Pi -I- 2 P 2 xy + bix + 62I/ 

Ay = yr'^Ki + yr^K2 + yr^K^ + 2 Pixy + { 2 y‘^ + r‘^)P2 
x = x + Ax 
y = y + Ay 



Where (xo^yo) is the principal point and r = is the radial distance 

from the principal point. The camera’s parameters are determined in a bundle 
adjustment using a planar test field. The bundle adjustment process is carried 
out before-hand. 



2.2 Resection 

The problem of (spatial) resection involves the determination of the six param- 
eters of the exterior orientation of a camera station. We use a two-stage process 
to solve the resection problem. A closed-form solution using four points gives 
the initial values for an iterative refinement using all control points. 



Closed-form Resection. Several alternatives for a closed form solution to the 
resection problem were given in literature. We follow the approach suggested by 
Fischler et. al. j^. Named the “Perspective 4 Point Problem” their algorithm 
solves for the three unknown coordinates of the projection center when the co- 
ordinates of four control points lying in a common plane are given. Because the 
control points are all located on a common plane the mapping in-between image- 
and object points is a simple plane-to-plane transformation T. The location of 
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the projection center can be extracted from this transformation T when the prin- 
cipal distance of the camera is known. For a detailed description of the formulas 
please refer to the original publication. 

To complete the solution of the resection problem we also need the orienta- 
tion of the camera in addition to its location. Kraus jS) gives the solution for 
determining the orientation angles when the coordinates of the projection center 
are already known. The algorithm makes use of only three of the four points. 



Iterative Refinement. The closed- form solution makes use of only four con- 
trol points. Usually much more control points are available and the solution is 
more accurate if all observations are used in a least squares adjustment. For an 
iterative solution, the collinearity equations have to be linearized. This is stan- 
dard in photogrammetry. While most of the partial derivatives of the classical 
bundle adjustment remain the same, the elements for rotation have changed be- 
cause of our use of quaternions . Six new partial derivatives replace the common 
K K 

duj ’ 

I = + cZ^Zy - aZyN - N^) 

I = + bZyN + cN^) 

I = ^i-aZl - hZ^Zy + ZyN - aN^) 
f = ^i-Z.Zy + aZ^N + cZl + cN^) 
g = ^{cZ^Zy - bZ^N + Zl + TV") 
i = ^{-aZ^Zy - Z^N - bZl - bN^) 

Using the results from the closed-form solution as described above, we can 
now iteratively refine the solution for all control points available. 



Simulated Results. Simply using the formulas given above and applying error 
propagation we can now do a simulation of the resection to predict the expected 
errors in resection. Assuming a focal length of 12 mm and with a standard pixel 
size of 6.7 /rm, the expected errors in x, y and z for a certain error in image 
measurement are given in table ^ 

3 Implementation 

The algorithms described above were implemented in an on-line pose measure- 
ment system. The systems components, both mechanical and electronic are de- 
scribed below. In addition the test configuration using an industrial robot is 
described. 



3.1 Hardware 

The optical sensor we are using is a Easier A113 camera, with a Sony CCD 
chip, which has a resolution of 1300 x 1030 pixels. The camera provides a digital 
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output according to IEEE standard RS644. A frame-grabber is integrated into 
a standard PC. The digital camera system is superior to an analog camera 
system since it does not exhibit line jitter problems thus enabling more precise 
measurements. Schneider-Kreuznach lenses with 12mm focal length are mounted 
onto the camera. 

To maximize the signal intensity we use retro-reflective targets and a ring 
light on the camera. To minimize the effects of external light sources we use 
near-infrared LEDs as light source and a Alter on the lens, (see figure Q left) 

The control points are fixed onto a plane, also used as a calibration plate. We 
use coded targets to achieve automated identification of the targets. The circular 
center of the target is used for sub-pixel precise measurement of the targets 
position. A circular code surrounding the center carries a binary representation 
of the unique identifier, (see figure ^ center) 

The set-up for our experiments consists of a Kuka KR15 robot. It is a six 
axis robot with a maximum payload of 15 kg at maximum range of 1570 mm. 
The robot is specified with a repeatability of ±0.1 mm. The absolute accuracy 
is not specified. 




Fig. 1. Left: camera and ring LED mounted onto the robot. Center: part of the test 
field. Right: Ball-bar tester. 



3.2 Image Processing 

Image processing is performed on a standard PC. The images are first binarized 
using an optimal thresholding algorithm introduced by Otsu (see 0). Then a 
consistent component labeling is performed to find targets. We discriminate the 
circular centers from all other ’blobs’ by size and a simple shape-test. Since the 
targets can be smaller than 10 pixels in diameter we do not perform ellipse fitting 
(see 0). The center of the target is computed as the weighted centroid. 

In addition the elliptic axes and the pose angle are computed. The center, 
the axes and the angle are used to determine the code ring. The code ring is 
read with six-times oversampling. The acyclic redundant code provides unique 
identifiers for up to 352 targets. 
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In addition to the coded target much smaller uncoded targets were added 
to the test field. After the closed-form resection has been computed using the 
coded targets, the approximate position of these targets can be projected onto 
the image since their three-dimensional location is known from calibration. Thus 
these additional targets can be easily identified through their position. 



3.3 Results 

The sensor delivers a frame rate of about 12 frames per second. The implemented 
system is capable to process a single image in 420ms. A typical image will contain 
about 30 coded and about 200 uncoded targets. This gives us a processing speed 
of 500 targets per second including all image processing steps and the resection. 
The standard deviations obtained in a first test run are given in table [D 



Table 1. Standard deviations of resection. 



standard deviations 


simulation 


test-run 


image measurement 


^ pixel 


pixel 




resection in x 


0.06 mm 


0.03 mm 


0.2 mm 


resection in y 


0.06 mm 


0.03 mm 


0.2 mm 


resection in z 


0.02 mm 


0.009 mm 


0.06 mm 



4 Circular Test 

ISO 230-4 jO] describes the “Circular tests for numerically controlled machine 
tools” . While these test were originally designed for the simultaneous movement 
of only two axes, they also have valid implications for other machines. When 
the test is carried out the robot performs a circular motion and a measurement 
system detects any deviation from the ideal path. The parameters of the test 
include 

1. diameter of the nominal path 

2. contouring feed 

3. machine axes moved to produce the actual path 

The results of the test include the radial deviation F, which is defined as the 
deviation between the actual path and the nominal path, where the center of the 
nominal path can be determined either by manual alignment or by least squares 
adjustment. 

4.1 Ball-Bar System 

There exist several commercial systems to perform the circular test. We chose 
as a reference system the Renishaw ball-bar system. The system consists mainly 
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of a linear displacement sensor which is placed between two ball-mounts, one 
near the tool center point (TCP) of the machine and one in the center of the 
circular path. The displacement sensor continually measures the distance of the 
two balls and thereby measures any deviations in the radius of the circular path, 
(see figured right) 

The device has a length of 150 mm, a measurement resolution of 0.1 fim, an 
approximate accuracy of 1 fim and a measurement range of ±1 mm. It is able 
to measure the deviation at a frequency of 300 Hz. 

For the actual test we programmed a circular path of 149.9 mm to ensure 
the system will stay within its measurement range. The contouring feed was 
10 mm/s. Because the TCP orientation was programmed to be constant, all six 
axes of the robot were moved to produce the path. 

4.2 Photogrammetric Test 

For the test of our own system we used exactly the same circular path of the 
robot. The calibration plate was placed on the floor, the camera looking down 
onto it. The circular path, the calibration plate and the image plane were all 
approximately parallel to each other. Our online system continuously records the 
six parameters of the exterior orientation. The projection center coordinates are 
used to compute the least squares adjustment of the circular path to determine 
the center and radius of the ideal path, as suggested by ISO 230-4. The deviation 
of the measured coordinates of the projection center to the ideal path is the radial 
deviation. The deviations in the rotation angle is also recorded. This feature is 
not available in the ball-bar test. 

4.3 Results 

FigureElshows the results of the circular test. We see that currently we are unable 
to achieve the same accuracies as the commercial ball-bar system. The deviations 
in rotation mentioned above are significant since the camera is mounted onto 
the robot with a certain distance (~ 100 mm) from the TCP. Therefore the 
deviations in the projection center not only represent deviations in the position 
of the TCP but also in its rotation. We can clearly see from figure 0c) how 
extreme deviations in orientation correspond to extreme deviations in position. 
Since the ball-mount of the ball-bar tester is located much closer (^ 10 mm) to 
the true TCP, it is less sensitive to errors in rotation. 

5 Summary 

The implemented system is an improvement of our off-line system published 
earlier 0. It has proven to be quite flexible and we believe it can be easily 
integrated into many application in robotics, especially applications in optical 
measurement. Currently the accuracies do not compare well to those of a com- 
mercial ball-bar system. However the deviations are mostly due to an error in 
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Fig. 2. Results of the circular test (a) for the ball-bar system the movement of the 
ball-mount center is shown)and (b) for the photogrammetric system the movement of 
the camera’s projection center is shown in mm. Deviations are magnified by a factor 
of 25. (c) Angular deviation and radial deviation. 



the robots TCP orientation. For future work the deviations measured at the pro- 
jection center should be re-transformed to the TCP from the known hand-eye 
calibration matrix. To achieve on-line pose correction the obtained pose infor- 
mation has to be passed directly to the robot control unit. 
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Abstract. In this paper we introduce a formalism for optimal camera parameter 
selection for iterative state estimation. We consider a framework based on Shan- 
non’s information theory and select the camera parameters that maximize the 
mutual information, i.e. the information that the captured image conveys about 
the true state of the system. The technique explicitly takes into account the a priori 
probability governing the computation of the mutual information. Thus, a sequen- 
tial decision process can be formed by treating the a posteriori probability at the 
current time step in the decision process as the a priori probability for the next 
time step. The convergence of the decision process can be proven. 

We demonstrate the benefits of our approach using an active object recognition 
scenario. The results show that the sequential decision process outperforms a ran- 
dom strategy, both in the sense of recognition rate and number of views necessary 
to return a decision. 



1 Introduction 

State estimation from noisy image data is one of the key problems in computer vision. 
Besides the inherent difficulties with developing a state estimator that returns decent 
results in most situation, one important question is whether we can optimize state esti- 
mation by choosing the right sensor data as input. It is well known that the chosen sensor 
data has a big influence on the resulting state estimation. This general contiguity has 
been discussed in detail in dozens of papers in the area of active vision where the main 
goal was to select the right sensor data to solve a given problem. 

In our paper we tackle the problem of optimal sensor data selection for state esti- 
mation by adjusting the camera parameters. The optimal camera parameters are found 
by minimizing the uncertainty and ambiguity in the state estimation process, given the 
sensor data. We will present a formal information theoretic metric for this informal char- 
acterization later on. We do not restrict the approach to acquiring sensor data once. The 
approach cycles through an action selection and sensor data acquisition stage where the 
sensor data decided for depends on the state estimation up to the current time step. One 
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important property of the proposed sequential decision process is that its convergence 
can he proven and that it is optimal in the sense of the reduction in uncertainty and 
ambiguity. We will demonstrate our approach in an active object recognition scenario. 

The general problem of optimal sensor data acquisition has been discussed before. 
Examples can be found in the area of active robot localization H and active object 
recognition II , where a similar metric has been used, but the sequential implementation 
is missing. The most related approach, not only from the application point of view, but 
also from a theoretical point of view is the work of Ol. The commonness, differences 
and improvements to this work are discussed later on. Similarities can also be found 
to the work of iBHOll . where a Bayesian approach [0 as well as an approach using 
reinforcement learning fTOII has been presented for optimally selecting viewpoints in 
active object recognition. Our approach can be seen as a theoretically justihable extension 
to this work. Interesting related work from the area of control theory are EQ. 

The paper is structured as follows. In the next section we describe the problem in a 
formal manner and introduce the metric that is optimized during one step of the sequential 
decision process. In Section 0 we build up the sequential decision process and give a 
sketch of the convergence proof which can be found in detail in Q. The active object 
recognition scenario is described in Section^ The experimental results are summarized 
in SectionEl The paper concludes with a summary and an outlook to future work. 



2 Formal Problem Statement 



p(®t) 







vA/^ 



• • • 



P(^t+n) 








a-i 




0 - 1+1 




Ot+n—l 



Fig. 1. General principle: reduce uncertainty and ambiguity (variance and multiple modes) in the 
pdf of the state xt by choosing appropriate information-acquisition actions at- 



The problem and the proposed solution are depicted in Fig. [I] A sequence of probabil- 
ity density functions (pdf) p{xt), Xt G S over the state space S is shown. The sequence 
starts with a uniform distribution, i.e. nothing is known about the state of the system. 
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Certain actions at are applied that select the sensor data at time step t. The following 
state estimation process results in a new probability distributions p{xt+i) over the state 
space. Finally, after n steps one should end up with a unimodal distribution with small 
variance and the mode close to the true state of the system. The problem now is twofold: 
first how to measure the success of a chosen action, i.e. how close the pdf is to a unimodal 
distribution with small variance. And second, how do we compute the action, that brings 
us closer to such a distribution. 

The hrst question can be answered by using information theoretic concepts. In in- 
formation theory the entropy of a pdf 

H{xt) = - p{xt)log{p{xt))dxt 
J Xt 

is defined which measures the amount of uncertainty in the outcome of a random ex- 
periment. The more unpredictable the outcome the larger the entropy is. It reaches its 
maximum for a uniform pdf and its minimum at zero for a delta function, i.e. for an 
unambiguous outcome. 

The answer to the second question can also be found in information theory. Assume 
the following setting: the system is in state Xt- The state itself cannot be observed but an 
observation Oj related with the state by a pdf p{ot\xt^ at). The pdf is also conditioned 
on the action at. In information theory the concept mutual information (MI) gives us a 
hint on which action at shall be chosen. The MI 

I{xt,Ot\at) = J J p{xt)p{ot\xt, at) log dotdxt (I) 

Xt Ot 

is the difference between the entropy H{xt) and the conditional entropy H{xt\ot, at). 
It describes how much uncertainty is reduced in the mean about the true state Xt after 
the observation. Since we introduced the dependency on the action at we can influence 
the reduction in uncertainty by selecting that action a* that maximizes the MI 

a* = a,Tgm.ax I{xt;ot\at) . (2) 

at 

All we need is the likelihood function p{ot\xt,at) and the a priori probability p{xt). 

In im a similar approach has been proposed in an active object recognition appli- 
cation, with the exception that the a priori information has been assumed to be uniform 
in any case. In the next section we extend this approach to a sequential decision process 
which convergence can be proven. The important difference is that we explicitly take 
into account the inherently changing prior. The prior changes, since new sensor data 
changes the information available about the true state. 



3 Optimal Iterative Sensor Data Selection 

From the previous section we know which action a* to select to get the sensor data Ot 
that best reduces the uncertainty in the state estimation. From the definition of MI it is 
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MMI BAYES 




Fig. 2. Sequential decision process of maximum mutual information (MMI) for camera parameter 
selection and Bayesian update of p(x\o, a) based on the observed feature o. 



obvious that the reduction will only be reached in the mean. As a consequence there might 
be observations under action at that result in an increase of the uncertainty. Another, 
more serious problem is, that there might be more than one sensor data acquisition step 
necessary to resolve all ambiguity. An example is presented later on in the experimental 
section in the area of object recognition. 

One way to deal with these problems is to form a sequential decision process and 
to take into account the information acquired so far, when selecting the next action. 
The sequential decision process consists of two steps: the selection of the best action 
at based on the maximum of the mutual information (MMI) and the application of the 
Bayes rule to compute the a posterior probability when the observation has been made. 
The posterior is then fed back and used as prior for the next time step. This is justified by 
the fact that the posterior contains all information acquired to far, i.e. sensor data fusion 
is implicitly done during this step. In Fig. |2| the whole sequential decision process is 
depicted. 

By definition the iterative decision process is optimal since each step is optimal with 
respect to the prior of the state Xt. Since the posterior is used as prior for the next time 
step we assure that the next action is selected considering the knowledge acquired so 
far. More important is the fact that this sequential decision process converges, i.e. the 
pdf p{x) over the state space will converge towards a certain distribution. Only a sketch 
of the proof is given in the following. 

The key point of the convergence proof is that a irreducible Markov chain can be 
defined representing the sequential decision process 19- Two corrolaries give us the 
proof of convergence. The first one is that the Kullback-Leibler distance between two 
distribution on a Markov chain will never increase over time. The second one is that the 
Kullback-Leibler distance between a distribution on a Markov chain and a stationary 
distribution on a Markov chain decreases over time. If there are more than one stationary 
distributions the convergence will be against the distribution with minimum Kullback- 
Leibler distance to all stationary distribution. Since each irreducible Markov chain has 
at least one stationary distribution we end up with a convergence toward a certain dis- 
tribution over the Markov chain. This distribution is difficult to compute. But by this 
result we know that the sequential decision process will converge. This convergence is 
important for practical considerations, i.e. when to stop the whole process. 

In practice this convergence can also be observed. In many of our experiments in the 
area of active object recognition the distribution converges to the optimum distribution 
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with respect to minimum entropy. Note, that it depends on the accuracy of the likelihood 
functions whether the resulting distribution will identify the right state. If the likelihood 
function, i.e. the relationship between state and observation, is erroneous, the sequential 
decision process cannot improve state estimation at all. On the one hand this might be 
seen as a drawback, since the state estimator is not optimized but only the sensor data 
provided for state estimation. On the other hand it is a big advantage, since any Bayesian 
state estimator at hand can be combined with the proposed sequential decision process. 
The more ambiguous the observations are the more the state estimation will be improved 
by our method. 

Due to lack of space we have restricted us here to the description on the main 
principles. A more detailed discussion on the underlying information theoretic concepts 
as well as on the evaluation of the differential mutual information by Monte Carlo 
techniques can be found in Q. There the reader will also find error bounds for the 
estimation of the mutual information. 

4 Active Object Recognition Using Viewpoint Selection 

To apply our proposed method we have chosen an object recognition scenario. We have 
selected a statistical Eigenspace approach which has been introduced in [0]. Here we 
apply it as the state estimator for classification. 

The key idea is that the projection c— of an image / into the Eigenspace of 

a class is assumed to be normally distributed, i.e. p{c\f, 17^) ~ A(/Tk, ^k)- Clas- 
sification is then done not by computing the minimum distance in Eigenspace between 
a projected test image / and the manifold of a certain class | B | but by maximizing the a 
posteriori probability ip(c|/, 17 k)p(17k)- As a consequence the prior can be explicitly 
taken into account and one does not get only the best class hypotheses but also a statistical 
measure for the match. Eor viewpoint selection the likelihood functions p{c\f, a, 17„) 
for each viewpoint a have to be estimated during training. In our case a maximum like- 
lihood estimation of the parameters of the Gaussian is performed. Due to lack of space 
only a coarse summary of the method could be given. More details are found in lEBEI . 



5 Experiments and Results 

Eive toy manikins form the data set (cf. Eig.Q. There are only certain views from which 
the objects can be distinguished. The main differences in the objects are the small items 
that the manikins carry (band, lamp, quiver, gun, trumpet). 

The experimental setup consists of a turntable and a robot arm with a camera mounted 
that can move around the turntable. The actions a = {cj>, 0)^ define the position of the 
camera on the hemisphere around the object. The statistical eigenspace approach is used 
as classifier. The state x is the class of the object. 

We compared our viewpoint planning with a random strategy for viewpoint selection. 
Tabled summarizes the results. The planning based on maximum mutual information 
outperforms a random strategy, in both recognition rate and number of views necessary 
for classification. In most cases the object can be recognized within three views at the 
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Fig. 3. The first view is ambiguous with respect to the objects in image two and three. The second 
and third view allow for a distinction of objects two and three but not to distinguish object one 
from four (the objects with and without quiver on the back). Similar arguments hold for the two 
objects shown in the last three images. 



latest. Also the maximum a posteriori probability after the decision for one class is larger 
in the mean for the viewpoint planning, indicating more confidence in the final decision 
(for example, object trumpet: 0.97 vs. 0.65). In contrast to other viewpoint selection 
approaches, for example based on reinforcement learning |Q, we do not need to train 
the optimal sequence. All necessary information is already encoded in the likelihood 
functions, which are provided by the Bayesian classifier. 



Table 1. Results for viewpoint planning and random viewpoint control (100 trials per object): 
Recognition rate, mean number of views, and the mean of the maximum a posteriori probability 
for the right class after the decision. 





planned viewpoint control 


random viewpoint control 


object 


rec. rate 


mean no. 

views 


mean max. a 
poster, prob. 


rec. rate 


mean 

views 


no. 


mean max. a 
poster, prob. 


band 


86 


1.13 


0.98 


77 




4.28 


0.95 


lamp 


97 


1.14 


0.98 


93 




4.94 


0.96 


quiver 


99 


1.05 


0.99 


95 




3.09 


0.97 


gun 


90 


2.19 


0.97 


80 




8.96 


0.69 


trumpet 


99 


2.29 


0.97 


70 




8.89 


0.65 


average 


94.2 


1.56 


0.97 


83.0 


6.03 


0.84 



In Fig. 0 (left) the MI is shown at the beginning of the sequential decision process, 
i.e. the prior is assumed to be uniform. The x- and y-axis are the motor-steps for moving 
the turntable and the robot arm, to select positions of the camera on the hemisphere. The 
motor-step values correspond to a rotation between 0 and 360 degree for the turntable 
and —90 to 90 degree for the robot arm. The MI is computed by Monte Carlo simulation 
in . The maximum in this 2D function in the case of viewpoint selection defines fhe besf 
action (viewpoint) to be chosen. In Fig. 0 (right) the corresponding view of the object 
is shown (for one of the objects as an example). This viewpoint is plausible, since the 
presence of the quiver as well as the lamp can be determined, so that three of the five 
objects can already be distinguished. 
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Fig. 4. Left: MI in the viewpoint selection example assuming a uniform prior (computed by Monte 
Carlo evaluation). The x and y are the motor-steps for the turntable and robot arm, respectively. 
Along the z axis the MI is plotted. Right: best view a decided by the maximum in the MI 
(a = (2550, 1500)). As example, object band is shown. 



In general the computation time depends linearly on the number of actions and the 
number of classes. In practice, for viewpoint selection less than one second is needed on 
a Pentium 11/300 for the computation of the best action using 1000 samples, 5 classes 
and a total of 3360 different actions (positions on the hemisphere). 



6 Conclusion 



We have presented a general framework for sensor data selection in state estimation. The 
approach has been applied to the optimal selection of camera parameters (viewpoint) in 
active object recognition. It is worth noting that the approach is not restricted to camera 
parameter selection but can be applied in any situation where the sensor acquisition 
process can be influenced. One examples is gaze control, where the pan/tilt/zoom pa- 
rameters of a camera are changed Q. Another example might be the adaptive change of 
illumination to enhance relevant features. 

The approach presented in this paper is independent from the state estimator at hand. 
The only requirement is that the state estimator must provide likelihood functions for the 
observation given the state. The big advantage of this fact is, that any state estimator can 
be improved by our method as long as the state estimator does not return systematically 
wrong results. 

Compared to previously published work our approach forms a sequential decision 
process and its convergence can be proven. In contrast to reinforcement learning ap- 
proaches 0 for active object recognition we do not need to train the optimal sequence. 
Thus, the typical tradeoff between exploitation and exploration in reinforcement learn- 
ing does not exist for our framework. All relevant information necessary to decide for 
an optimal action is already encoded in the likelihood functions and the prior. The prior 
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is computed step by step during the recognition process and the likelihood functions are 
assumed to be provided by the state estimator. Experiments showed that the framework 
works in an object recognition scenario with a state of the art classifier and outperforms 
a random strategy. 

In our current work we extended this approach to state estimation of dynamic systems 
and we will modify the algorithms in a way that also continuous actions spaces can be 
handled. 
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Abstract. From the task of automatically reconstructing real world scenes using 
range images, the problem of planning the image acquisition arises. Although 
solutions for small objects in known environments are already available, these 
approaches lack scalability to large scenes and to a high number of degrees of 
freedom. In this paper, we present a new planning algorithm for initially unknown, 
large indoor environments. Using a surface representation of seen and unseen parts 
of the environment, we propose a method based on the analysis of occlusions. 
In addition to previous approaches, we take into account both a quality criterion 
and the cost of the next acquisition. Results are shown for two large indoor 
scenes - an artificial scene and a real world room - with numerous self occlusions. 

Keywords: view planning, active vision, range image fusion, 3d reconstruction, 
modelling from reality, autonomous exploration. 



1 Introduction 

In recent years a number of approaches to the reconstruction of real world scenes from 
range images have been proposed. However, the acquisition of the required range images 
is a time-consuming process which human operators usually cannot perform efficiently. 
Alternatively, the acquisition step can be automated. While the necessary robots are 
available for a range from small objects to large scale environments, there is no suffi- 
ciently scalable algorithm to determine the acquisition parameters of the range images. 
Thus, an appropriate view planning technique is crucial e.g. for the fast and cost effec- 
tive generation of virtual reality models, as-built analysis of architectural and industrial 
environments, and remote inspection of hazardous environments. 

In this paper, we address the problem of view planning with an initially unknown 
geometry of the scene and a given set of parameters for the first view. The parameter 
sets for the subsequent views are planned iteratively within the acquisition cycle, which 
can be outlined in four steps; 

1 . Image acquisition. 

* The work presented here was done while Konrad Klein was with the EC-Joint Research Centre, 
funded by the European Commission’s TMR network “CAMERA”, contract number ERBFM- 
RXCT970127. 
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2. Integration of the newly acquired image into the reconstructed model using an 
appropriate representation of unseen parts of the object. 

3. View Planning: Determination of the parameters for the next acquisition or termi- 
nation if the result is not improved by further acquisitions. 

4. Moving the acquisition system to the next capture point. 

The majority of approaches previously proposed is based on the analysis of the 
boundary surface of the visible volume. Conolly [Q proposes an octree-based ray-casting 
approach of which the rationale is a discretisation of the problem using a spherical dis- 
cretisation map for the searched viewing parameters, and an occupancy-grid representa- 
tion for the object. Maver and Bajcsy @| aim to completely acquire an object’s surface 
from one single direction using a camera-laser triangulation system based on the analysis 
of occlusions. This approach relies on triangulation based scanners and is thus unsuitable 
for the use in large scenes. Pito [5 j proposes an approach based on the analysis of both 
the measured and the “void surface”, which bounds unseen parts of the visible volume. 
The optimisation of the parameter set for the next view is performed in two steps, of 
which the first reduces the search space by computing possible optima positions without 
taking occlusions into account. Some authors llbl7tSI consider the occlusion surfaces dis- 
cretely, thereby avoiding an exhaustive evaluation of the search space. These approaches 
are best suitable when a pre-existing CAD-model can be used in the planning step, while 
the high number of occlusion surfaces in noisy real world data render these approaches 
unusable. 

This paper presents a novel method for view planning in large, complex indoor envi- 
ronments with an acquisition system providing eight degrees of freedom (3D position, 
viewing direction, field of view, and resolution). The core component of the method is 
our innovatory objective function, which quantifies the cost-benefit ratio of an acquisi- 
tion. This function is subject to a global optimisation method in order to determine the 
set of parameters for the following image acquisition. The aim of the overall planning 
algorithm is a complete surface model measured with a predefined sampling density in 
each point. 



2 View Planning 

The parameters to be determined for each view mainly depend on the acquisition system. 
A typical range acquisition system is based on a laser range finder mounted on a pan- 
and-tilt unit with variable field of view and resolution, which can be freely positioned 
using a tripod or a mobile robot (Fig. [U. Consequently, eight parameters have to be 
determined which are the 3D position, the viewing direction (two angles), the horizontal 
and vertical field of view, and the resolution. 

When determining the set of parameters of the next view, the following partly com- 
petitive goals are pursued: 

- Maximise the visible volume by resolving occluded volumes. 

- Maximise the area of the surface measured with sufficient sampling density. 

- Minimise the resources needed to reconstruct the scene with a predefined quality. 
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Fig. 1. Tripod and mobile acquisition robot (Autonomous Environmental Sensor for Telepresence 
AEST l;9j). 



These goals are incorporated in an objective function G which evaluates a parameter 
set for a given scene description based on the previously taken images. The view plan- 
ning problem is then addressed by solving a constrained continuous global optimisation 
problem for the search space D C IR® limited by the constraints formulated in Section 
o Besides the hrst range image, the algorithm requires the desired sampling density 
as an input parameter. 

2.1 Iterative Surface Construction 

Following Pito |3l we use a binary partition of space into the visible volume, which is 
located between any of the capture positions and the surface measured from that position, 
and the invisible volume, which is unseen in the images already acquired because it is 
located outside all helds of view or because it is occluded by measured surfaces. The 
invisible volume consists of the object volume and additional void volume. Accordingly, 
the surface S of the visible volume is partitioned into a measured part and the unmeasured 
void surface (Fig.0. 

This partition is described for acquisition step i by the volume Vi visible in the 
acquired image, the measured surface rrii, and the unmeasured {void) surface Ui of the 
visible volume, so that the surfaces rrii and Ui together completely enclose the volume 
Vi- For the integration of multiple range images into a single surface description, we 
denote the surface already measured in step 1 ... i by Mi, and the void surface that has 
not been resolved in step 1 ... i by (7^. While these surfaces are initially M\ := mi and 
t/i := Ml, we iteratively construct the new representation in each step by 

Ui+i := Ui n Vi+i U Ui+i n (mi U . . . U Mi) 

Mi+1 := Mi U m*+i (1) 

so that the results of the previous iteration of the acquisition cycle can be re-used. Both 
the measured surface Mi and the void surface Ui are represented by a triangle mesh. 
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void volume 




The union of two meshes is implemented using a mesh zippering approach, while the 
intersection between a mesh and a volume is implemented by clipping the triangles 
against the volume. Standard methods of mesh simplification |H|| can be used to reduce 
the number of triangles. 



2.2 Objective Function 

When planning views to an partly unknown environment, one has to assess possible 
views such that the unknown environment is captured within the planned sequence of 
views. It is reasonable to achieve this by additional measurements towards a void surface 
because the void volume decreases and the area of measured surface increases in this 
manner. However, the effect cannot be quantitatively predicted. Instead, we treat the 
void surface like an unmeasured object surface and plan the next view accordingly. 

When evaluating a set of parameters, we take into account the field of view, the 
viewing direction, and the resolution of the next view. These values are expressed by 
the variables 'dh and dy for the horizontal and vertical field of view, ipa and (pe for 
the azimuth and elevation angle of the sensor, and m for the resolution as a factor in 
reference to a standard resolution. Using these definitions, the visibility of the point p 
from point X is expressed by the weight w^(p, x, ipa,‘fe,'0h, "^v), which is 1 if the point 
p is visible in the evaluated view and 0 otherwise. 

For the incorporation of a quality criterion we define a function j3 : S — > IR which 
yields the sampling density for a given point on the surface S. Points on void surfaces 
yield a function value of 0. Additionally, we associate each point p on S with a desired 
sampling quality /3max(p)- 

Accordingly, we define a function F : S x R® — R, which expresses the expected 
sampling density yielded at a point on the surface S when viewed with the evaluated 
set of parameters. With the solid angle Apatch covered by one pixel at the position in the 
image grid in standard resolutioiQ onto which the point p G S is projected, the function 

* The size of one pixel on the unit sphere (the solid angle) can be computed for every grid position 
(u, v) € in the field of view and for every resolution > 0 using the Jacobian matrix of 
the mapping from the image grid to Cartesian coordinates. We assume the size of a pixel to be 
linear in the resolution factor m. 
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F is formulated by 




( 2 ) 



where n is the normal of the surface S in point p and d= |p — x | is the distance between 
the capture point x and p. 

In order to avoid a high number of small Images or a high number of data points 
the costs of an image acquisition with the evaluated set of parameters have to be taken 
into account as well. To meet these demands, we express the objective function as a 
cost-benefit ratio which has to be minimised, thereby taking the number of pixels N 
and the costs ca of each acquisition into account. The benefit depends on the parts of 
the scene actually seen from point x and is thus integrated over the visible area of the 
surface S. Summarising the geometric parameter vector by (x, ipa,‘Pe,'dh, "dv) =■ t the 
objective function can finally be expressed by the cost-benefit ratio 



where the value of ca has to be chosen empirically and is in the order of magnitude 
of the number of pixels in a typical image. Considering the predefined desired quality 
/3max(p) ensures that redundant measurements do not increase the computed benefit. 

2.3 Implementation 

We compute the objective function defined in Eq. 0 efficiently using an image-based- 
rendering approach where the results of the expensive visibility determination can be 
re-used for multiple evaluations of the function with a constant viewpoint x PJI- Conse- 
quently, an appropriate method of global optimisation is required which allows for an 
efficient order of evaluated parameter sets and is robust in the presence of noise caused 
by the computation in image precision. We meet these demands by a simple uniform 
grid search algorithm. 

During the global optimisation, two different constraints have to be taken into ac- 
count, of which the first allows for the operational range of the acquisition device and 
the necessary distance to obstacles. The second constraint is an overlap criterion which 
ensures the registration of the newly acquired image with the model reconstructed so 
far. Our experiments showed an overlapping share of 20% of the pixels to be sufficient. 

When automatically reconstructing a scene, the acquisition cycle requires a termi- 
nation criterion, as which we use the improvement of the sampling density with respect 
to the desired sampling density: If the area of the surface reconstructed with the desired 
sampling density does not increase by an acquisition step, the algorithm terminates. 



G(r, m) : 



m ■ N + Ca 




3 Results 



When evaluating the results of the planning technique, two main criteria can be dis- 
tinguished, of which the first is the percentage of the scene volume reconstructed at a 
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given iteration. As the presence of obstacles in an invisible volume cannot be decided on, 
the percentage of the reconstructed volume is crucial for automatic path planning with 
an autonomous mobile acquisition robot. The second criterion is the amount of surface 
area represented at a given resolution. We evaluate the percentage of the surface area 
represented with the desired sampling density as well as the percentage represented with 
lower densities (75%, 50%, 25% of the desired resolution, and the percentage visible 
at all). By analysing both criteria as functions of the iteration number the quality of the 
reconstruction in each iteration can be estimated. 




Fig. 3. Rendered, backface-culled view of the artificial scene (top) and reconstruction result after 
10 automatically planned image acquisitions for a real world scene (bottom). 



We present two complex scenes to demonstrate the practical usability of the method. 
The first test scene is an artificial model of 20m x 7m x 5.3m shown in Fig.|3which is 
used with an acquisition simulator. Surfaces which cannot be measured by the acquisition 
system are simulated by three holes in the model, so that these areas remain undefined 
in all range images. 

The sensor geometry and the constraints match the data for our laser scanner (ap- 
proximately polar coordinate system from a pan-and-tilt unit): height between 0.9 m 
and 2.0 m, viewing direction 0° to 360° (azimuth angle) and —60° to 60° (elevation 
angle), horizontal field of view 38° to 278°, vertical field of view 30° to 60°. The max- 
imal angular resolution is 0.1° per pixel in both directions, while the desired sampling 
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density /3max is 1600 samples per (corresponding to a 2.5 cm grid). Fig.^shows the 
proposed criteria for a sequence of 20 views. While the reconstructed volume quickly 
reaches 99% of the hnal volume (iteration 12), the desired sampling density requires 
significantly more iterations: although 95% of the surface area are depicted in one of 
the 20 images, only 37.4% are represented with the desired sampling density. This is 
mainly due to the complicated geometry which requires a high number of views from a 
close distance. 







Fig. 4. Results for the real world scene (left) and the artificial scene (right): Percentage of recon- 
structed volume (top) and percentage of surface area represented with given sampling densities 
(bottom) for each iteration. 



The second test scene is a typical complex laboratory environment (10m x 6m 
x4.3m), which serves as a complicated real world scene. All technical parameters are 
identical to those from the experiment with the artificial scene, except for the desired 
sampling density of 625 samples per m^ (corresponding to a 4 cm grid), and the maximum 
angular resolution of 0.5° per pixel (both directions). Fig.0]shows the evaluation data 
from 10 views. While some occlusions have not been resolved, the algorithm yields a 
sufficient density in more than 70% of the reconstructed surface area. This demonstrates 
the efficiency of the method with limited resources. Failures in the registration due to 
an insufficient overlap did not occur in our experiments. 



4 Conclusion and Future Work 

We present a technique of planning the range image acquisition for the reconstruction 
of large, complex indoor scenes which incorporates both cost and benefit of the image 
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acquisition. The practical usability is demonstrated in the automated reconstruction of 
two large scenes. Future work will address the use of information from previous iterations 
of the acquisition cycle. 
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Abstract. With the services that autonomous robots are to provide 
becoming more demanding, the states that the robots have to estimate 
become more complex. In this paper, we develop and analyze a proba- 
bilistic, vision-based state estimation method for individual, autonomous 
robots. This method enables a team of mobile robots to estimate their 
joint positions in a known environment and track the positions of au- 
tonomously moving objects. The state estimators of different robots co- 
operate to increase the accuracy and reliability of the estimation process. 

This cooperation between the robots enables them to track temporarily 
occluded objects and to faster recover their position after they have lost 
track of it. The method is empirically validated based on experiments 
with a team of physical robots. 



1 Introduction 

Autonomous robots must have information about themselves and their environ- 
ments that is sufficient and accurate enough for the robots to complete their 
tasks competently. Contrary to these needs, the information that robots receive 
through their sensors is inherently uncertain: typically the robots’ sensors can 
only access parts of their environments and their sensor measurements are inac- 
curate and noisy. In addition, control over their actuators is also inaccurate and 
unreliable. Finally, the dynamics of many environments cannot be accurately 
modeled and sometimes environments change nondeterministically. 

Recent longterm experiments with autonomous robots US! have shown that 
an impressively high level of reliability and autonomy can be reached by ex- 
plicitly representing and maintaining the uncertainty inherent in the available 
information. One particularly promising method for accomplishing this is prob- 
abilistic state estimation. Probabilistic state estimation modules maintain the 
probability densities for the states of objects over time. The probability density 
of an object’s state conditioned on the sensor measurements received so far con- 
tains all the information which is available about an object that is available to a 
robot. Based on these densities, robots are not only able to determine the most 
likely state of the objects, but can also derive even more meaningful statistics 
such as the variance of the current estimate. 

Successful state estimation systems have been implemented for a variety of 
tasks including the estimation of the robot’s position in a known environment, 
the automatic learning of environment maps, the state estimation for objects 
with dynamic states (such as doors), for the tracking of people locations, and 
gesture recognition 

With the services that autonomous robots are to provide becoming more 
demanding, the states that the robots have to estimate become more complex. 
Robotic soccer provides a good case in point. In robot soccer (mid-size league) 
two teams of four autonomous robots play soccer against each other. A proba- 
bilistic state estimator for competent robotic soccer players should provide the 

B. Radig and S. Florczyk (Eds.): DAGM 2001, LNCS 2191, pp. 321-^^^ 2001. 
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action selection routines with estimates of the positions and may be even the 
dynamic states of each player and the ball. 

This estimation problem confronts probabilistic state estimation methods 
with a unique combination of difficult challenges. The state is to be estimated 
by multiple mobile sensors with uncertain positions, the soccer field is only partly 
accessible for each sensor due to occlusion caused by other robots, the robots 
change their direction and speed very abruptly, and the models of the dynamic 
states of the robots of the other team are very crude and uncertain. 

In this paper, we describe a state estimation module for individual, au- 
tonomous robots that enables a team of robots to estimate their joint positions 
in a known environment and track the positions of autonomously moving ob- 
jects. The state estimation modules of different robots cooperate to increase the 
accuracy and reliability of the estimation process. In particular, the cooperation 
between the robots enables them to track temporarily occluded objects and to 
faster recover their position after they have lost track of it. 

The state estimation module of a single robot is decomposed into subcom- 
ponents for self-localization and for tracking different kinds of objects. This 
decomposition reduces the overall complexity of the state estimation process 
and enables the robots to exploit the structures and assumptions underlying the 
different subtasks of the complete estimation task. Accuracy and reliability is 
further increased through the cooperation of these subcomponents. In this co- 
operation the estimated state of one subcomponent is used as evidence by the 
other subcomponents. 

The main contributions of this paper are the following ones. First, we show 
that image-based probabilistic estimation of complex environment states is fea- 
sible in real time even in complex and fast changing environments. Second, we 
show that maintaining trees of possible tracks is particularly useful for estimating 
a global state based on multiple mobile sensors with position uncertainty. Third, 
we show how the state estimation modules of individual robots can cooperate in 
order to produce more accurate and reliable state estimation. 

In the remainder of the paper we proceed as follows. Section |2| describes the 
software architecture of the state estimation module and sketches the interactions 
among its components. Section 0 provides a detailed description of the individual 
state estimators. We conclude with our experimental results and a discussion of 
related work. 

2 Overview of the State Estimator 

Fig. ^ shows the components of the state estimator and its embedding into the 
control system. The subsystem consists of the perception subsystem, the state 
estimator itself, and the world model. The perception subsystem itself consists of 
a camera system with several feature detectors and a communication link that 
enables the robot to receive information from other robots. The world model 
contains a position estimate for each dynamic task-relevant object. In this paper 
the notion of position refers to the x- and y-coordinates of the objects and 
includes for the robots of the own team the robot’s orientation. The estimated 
positions are also associated with a measure of accuracy, a covariance matrix. 

The perception subsystem provides the following kinds of information: 
(1) partial state estimates broadcasted by other robots, (2) feature maps ex- 
tracted from captured images, and (3) odometric information. The estimates 
broadcasted by the robots of the own team comprise the estimate of the ball’s 
location. In addition, each robot of the own team provides an estimate of its 
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Fig. 1. (a) Architecture of the state estimator, (b) The hgure shows an image captured 
by the robot and the feature maps that are computed for self, ball, and opponent 
localization. 

own position. Finally, each robot provides an estimate for the position of ev- 
ery opponent. From the captured camera images the feature detectors extract 
problem-specific feature maps that correspond to (1) static objects in the envi- 
ronment including the goal, the borders of the field, and the lines on the field, 
(2) a color blob corresponding to the ball, and (3) the visual features of the oppo- 
nents. The state estimation subsystem consists of three interacting estimators: 
the self localization system, the ball estimator, and the opponents estimator. 
State estimation is an iterative process where each iteration is triggered by the 
arrival of a new piece of evidence, a captured image or a state estimate broad- 
casted by another robot. The self localization estimates the probability density 
of the robot’s own position based on extracted environment features, the esti- 
mated ball position, and the predicted position. The ball localizer estimates the 
probability density for the ball position given the robot’s own estimated position 
and its perception of the ball, the predicted ball position, and the ball estima- 
tions broadcasted by the other robots. Finally, the positions of the opponents are 
estimated based on the estimated position of the observing robot, the robots’ 
appearances in the captured images, and their positions as estimated by the 
team mates. Every robot maintains its own global world model, which is con- 
structed as follows. The own position, the position of the ball, and the positions 
of the opponent players are produced by the local state estimation processes. 
The estimated positions of the team mates are the broadcasted results of the 
self localization processes of the corresponding team mates. 

3 The Individual State Estimators 

3.1 Self- and Ball-Localization 

The self- and ball-localization module iteratively estimates, given the observa- 
tions taken by the robot and a model of the environment and the ball, the 
probability density over the possible robot and ball positions. A detailed de- 
scription and analysis of the applied alogrithms can be found in PC! or in the 
long version of this paper enclosed on the CD. 

3.2 Opponents Localization 

The objective of the opponents localization module is to track the positions of the 
other team’s robots. The estimated position of one opponent is represented by 
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Fig. 2. (a) The multiple hypotheses framework for dynamic environment modeling, (b) 
An estimate of the robot’s distance is given through the intersection of the viewing ray 
with the ground plane of the field. 

one or more alternative object hypotheses. Thus the task of the state estimator is 
to (1) detect feature blobs in the captured image that correspond to an opponent, 
(2) estimate the position and uncertainties of the opponent in world coordinates, 
and (3) associate them with the correct object hypothesis. In our state estimator 
we use Reid’s Multiple Hypotheses Tracking (MHT) algorithm uni as the basic 
method for realizing the state estimation task. In this section we demonstrate 
how this framework can be applied to model dynamic environments in multi- 
robot systems. We extend the general framework in that we provide mechanisms 
to handle multiple mobile sensors with uncertain positions. 

Multi Hypotheses Tracking. We will describe the Multiple Hypotheses 
Tracking method by first detailing the underlying opponents model, then ex- 
plaining the representation of tracked opponents position estimates, and finally 
presenting the computational steps of the algorithm. 

The Opponents Model. The model considers opponent robots to be moving ob- 
jects of unknown shape with associated information describing their temporal 
dynamics, such as their velocity. The number of the opponent robots may vary. 
The opponent robots have visual features that can be detected as feature blobs 
by the perception system. 

The Representation of Opponent Tracks. When tracking the positions of a set of 
opponent robots there are two kinds of uncertainties that the state estimator has 
to deal with. The first one is the inaccuracy of the robot’s sensors. We represent 
this kind of uncertainty using a Gaussian probability density. The second kind of 
uncertainty is introduced by the data association problem, i.e. assigning feature 
blobs to object hypotheses. This uncertainty is represented by a hypotheses tree 
where nodes represent the association of a feature blob with an object hypothesis. 
A node Hj(t) is a son of the node Hi{t—1) if Hj{t) results from the assignment 
of an observed feature blob with a predicted state of the hypothesis Hi{t — 1). 
In order to constrain the growth of the hypotheses tree, it is pruned to eliminate 
improbable branches with every iteration of the MHT. 

The MHT Algorithm. Fig.Et^ outlines the computational structure of the MHT 
algorithm. An iteration begins with the set of hypotheses H{t — 1) from the 
previous iteration t — 1. Each hypothesis represents a different assignment of 
measurements to objects, which was performed in the past. The algorithm main- 
tains a Kalman filter for each hypothesis. For each hypothesis a position of the 
dynamic objects is predicted Zi{t) and compared with the next observed oppo- 
nent performed by an arbitrary robot of the team. Assignments of measurements 
to objects are accomplished on the basis of a statistical distance measurement. 
Each subsequent child hypothesis represents one possible interpretation of the 
set of observed objects and, together with its parent hypothesis, represents one 
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The Unscented Transformation 

The general problem is as follows. Given an n-dimensional vector random variable 
X with mean x and covariance Cx we would like to estimate the mean y and the 
covariance Cy of an m-dimensional vector random variable y. Both variables are 
related to each other by a non-linear transformation y = g{x). The unscented 
transformation is dehned as follows: 

1. Compute the set Z of 2n points from the rows or columns of the matrices 
±y/nCx- This set is zero mean with covariance Cx- The matrix square root 
can efficiently be calculated by the Cholesky decomposition. 

2. Compute a set of points X with the same covariance, but with mean x, by 
translating each of the points as Xi = Zi -\- x. 

3. Transform each point Xi £ X to the set Y with j/i = 

4. Compute y and Cy by computing the mean and covariance of the 2n points 
in the set Y . 



Fig. 3. Outline of the unscented transformation. 

possible interpretation of all past observations. With every iteration of the MHT 
probabilities describing the validity of an hypothesis are calculated Q. In order 
to constrain the growth of the tree the algorithm prunes improbable branches. 
Pruning is based on a combination of ratio pruning, i.e. a simple lower limit 
on the ratio of the probabilities of the current and best hypotheses, and the 
iV-scan-back algorithm nm. The algorithm assumes that any ambiguity at time 
t is resolved by time t + N. Consequently if at time t hypothesis Hi{t — 1) has 
n children, the sum of the probabilities of the leaf notes of each branch is cal- 
culated. The branch with the greatest probability is retained and the others are 
discarded. 

Feature Extraction and Uncertainty Estimation. This section outlines the 
feature extraction process which is performed in order to estimate the positions 
and the covariances of the opponent team’s robots. Each opponent robots is 
modeled in world coordinates by a bi- variate Gaussian density with mean ^ and 
a covariance matrix C-^. 

At present it is assumed that the opponent robots are constructed in the 
same way and have approximately circular shape. All robots are colored black. 
Friend foe discrimination is enabled through predefined color markers (cyan and 
magenta, see Fig. Cb) on the robots. Each marker color may be assigned to any 
of the two competing teams. Consequently it is important that the following 
algorithms can be parameterized accordingly. Furthermore we assume that (see 
Fig. Eb) the tracked object almost touches the ground. The predefined robot 
colors allow a relatively simple feature extraction process. 

Step 1: Extraction of Blobs Containing Opponent Robots. From a 
captured image the black color-regions are extracted through color classifica- 
tion and morphological operators. In order to be recognized as an opponent 
robot a black blob has to obey several constraints, e.g. a minimum size and a 
red or green color-region adjacent to the bottom region row. Through this we 
are able to distinguish robots from black logos and adverts affixed on the wall 
surrounding the field. Furthermore blobs that contain or have a color-region of 
the own team color in the immediate neighborhood are discarded. 
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Fig. 4. (a) Intermediate uj mean and covariance Cui (b) Propagation of uncertainties. 

For all remaining regions three features are extracted: The bottom most pixel 
row which exceeds a predefined length, the column col representing the center 
of gravity and a mean blob width in pixels. For the latter two features only 
the three bottom most rows which exceed a certain length are used. In order 
to determine these rows, we allow also for occlusion through the ball. If the 
length of these rows exceeds an upper length, we assume that we have detected 
two opponents which are directly next to each other. In this case two centers of 
gravity are computed and the width is halted. 

In order to detect cascaded robots, i.e. opponent robots that are partially oc- 
cluded by other robots, our algorithm also examines the upper rows of the blobs. 
As soon as the length of a blob row differs significantly from the length of its 
lower predecessor and the respective world coordinates indicate a height of more 
than 10 cm above the ground we assume that we have detected cascaded robots. 
In this case we split the blob into two and apply the above procedure to both 
blobs. Empirically we have found that this feature extraction procedure is suffi- 
cient to determine accurate positions of opponent robots. Mistakenly extracted 
objects are generally resolved in a fast manner by the MHT algorithm. 

Step 2: Estimation of Opponent Position and Uncertainty. In the fol- 
lowing we will estimate the position and covariance of an observed robot. For 
this the pose and the covariance of the observing robots as well as position of the 
detected feature blob in the image and the associated measurement uncertainties 
are taken into account. 

We define a function opp that determines the world coordinates of an opponent 
robot based on the pose $ of the observing robot, the pixel coordinates row, col 
of the center of gravity and the width width of the opponent robot’s blob. Due to 
rotations and radial distortions of the lenses opp is non-linear. First the function 
opp converts the blob’s pixel coordinates to relative polar coordinates. On this 
basis and the width of the observed blob the radius of the observed robot is 
estimated. Since the polar coordinates only describe the distance to the opponent 
robot but not the distance to its center, the radius is added to the distance. 
Finally the polar coordinates are transformed into world coordinates taking the 
observing robot’s pose €> into account. 

In order to estimate the position ip and the covariance C.^ of an opponent robot, 
we will use a technique similar to the unscented transformation 0 (see Fig.|^. 
First an intermediate mean u) and covariance C^j describing jointly the observ- 
ing robot’s pose and the observed robot is set up (see Fig. 4>, row, col 
and width are assumed to be uncorrelated with a variance of one pixel. To this 
mean and covariance the unscented transformation using the non-linear map- 
ping opp is applied. This yields the opponent robot’s position ip and covariance 
C^. In Fig. 03 the uncertainties of objects depending on the uncertainty of the 
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Fig. 5. (a) Two robots are traveling across the field, while they observe three stationary 
robots of the opponent team. The diamonds and crosses indicate the different measure- 
ments performed by the observing robots, (b) The resolved trajectory (continuous line) 
of an opponent robot, observed by two robots. The real trajectory is displayed as dotted 
line. The dashed lines indicate the robot’s 90 degrees field of view. 

observing robot and their relative distances are displayed using Icr-contours. For 
illustrative purposes the uncertainty ellipses are scaled by an order of five. Each 
robot observes two obstacles in 3.5 and 7 meters distance. Robot Odilo is very 
certain about its pose and thus the covariance of the observed robot depends 
mainly on its distance. Robot Grimoald has a high uncertainty in its orientation 
(~ 7 degrees). Consequently the position estimate of the observed obstacle is 
less precise and is highly influenced by the orientation uncertainty. 

Step 3: Association of Opponents to Object Hypotheses. The associa- 
tion of an opponent robot’s position with a predicted object position is cur- 
rently performed on the basis of the Mahalanobis distance. In future we intent 
to use the Bhattacharyya distance, which is a more accurate distance measure 
for probability distributions. 

4 Experimental Results 

The presented algorithms are applied in our middle-size RoboCup team, the 
AGILCQ RoboCuppers. At present, the RoboCup scenario defines a fixed world 
model with field-boundaries, lines and circles (see Fig. ED. Our approach was 
successfully applied in 1999 and 2000 during the RoboCup World Soccer Cham- 
pionship in Stockholm and Melbourne and the German Vision Cup. During a 
RoboCup match, every robot is able to process 15 to 18 frames per second with 
its on-board Pentium 200 MHz computer. When the robots planning algorithms’ 
are turned off the vision system is easily able to cope with the maximum frame 
rate of our camera (25 fps). The localization algorithm runs with a mean pro- 
cessing time of 18 msec for a 16-Bit RGB (384 * 288) image. Only for 4% of the 
images the processing time exceeds 25 msec. A detailed analysis of the self- and 
ball-localization algorithm can be found in 

In the following we will present experiments that investigate the capability of 
tracking multiple opponent robots by our system. In the first experiment we have 
examined the capability of our algorithms to detect and estimate the opponent 
robots positions. The robots Odilo and Grimoald are simultaneously traveling in 
opposite directions from one side of the playing field to the other (see Fig. EJl). 
While they are in motion they are observing three stationary robots which are set 
up at different positions in the middle of the field. Diamonds and crosses indicate 
the observed opponents by Odilo and Grimoald, respectively. The variance in the 

^ The name is an homage to the oldest ruling dynasty in Bavaria, the Agilolfinger. The 
dynasties most famous representatives are Grimoald, Hugibert, Odilo and Tassilo 
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observation is due to positions estimations over long distances (4 to 7 meters) and 
minor inaccuracies in the robots self-localization. Furthermore it is noteworthy 
that the vision system of both robots never mistook their teammate as opponent. 

The second experiment examines the tracking and data fusion capability of 
our system. Odilo and Grimoald are set up at different positions on the field. An 
opponent robot crosses the field diagonally from the corner of the penalty area 
at the top right to the corner of the penalty area at the lower left (see Fig.l^r). 
The first part of the journey is only observed by Grimoald, the middle part of 
both robots and the final part only by Odilo. The 90 degrees field of view of 
both robots is indicated through lines. 

The opponent robot was tracked using the MHT algorithm with a simple 
linear Kalman filter. The state transition matrix, described a constant velocity 
model and the measurement vector provided positional information only. The 
positions of the opponent robots and their uncertainties were computed accord- 
ing to the algorithm described in section 13.21 Positional variance for the pixel 
coordinates of the region’s center of gravity and region’s width was assumed 
to be one pixel. The process noise was assumed to be white noise acceleration 
with a variance of 0.1 meters. The Mahalanobis distance was chosen such that 
P{X < X 2 } = 0.95. N-scan-back pruning was performed from a depth of = 3. 
In general the update time for one MHT iteration including A^-scan-back and 
hypo pruning was found to be less than 10 msec. This short update time is due 
to the limited number of observers and observed objects in our experiment. We 
expect this time to grow drastically with an increasing number of observing and 
observed robots. However within a RoboGup scenario a natural upper bound 
is imposed through the limited number of robots per team. A detailed analy- 
sis of the hypothesis trees revealed that only at very few occasions new tracks 
were initiated. All of them were pruned away within two iterations of the MHT. 
Overall the observed track (see Fig. 03, continuous line) diverges relatively little 
from the real trajectory (dotted line). 

5 Conclusions 

In this paper, we have developed and analyzed a cooperative probabilistic, vision- 
based state estimation method for individual, autonomous robots. This method 
enables a team of mobile robots to estimate their joint positions in a known 
environment and track the positions of autonomously moving objects. 
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Abstract. A complete processing chain for visual object recognition 
is described in this paper. The system automatically detects individual 
objects on an assembly line, identifies their type, position, and orien- 
tation, and, thereby, forms the basis for automated object recognition 
and manipulation by single-arm robots. Two new ideas entered into 
the design of the recognition system. First we introduce a new fast 
and robust image segmentation algorithm that identifies objects in an 
unsupervised manner and describes them by a set of closed polygonal 
lines. Second we describe how to embed this object description into 
an object recognition process that classifies the objects by matching 
them to a given set of prototypes. Furthermore, the matching function 
allows us to determine the relative orientation and position of an 
object. Experimental results for a representative set of real-world tools 
demonstrate the quality and the practical applicability of our approach. 

Keywords: Object Recognition, Shape, Mesh Generation, Model Selec- 
tion, Robotics 

1 Introduction 

Object manipulation with assembly line robots is a relatively simple task if the 
exact type, position and spatial alignment of objects are fixed or at least known 
in advance. The difficulty of the problem increases significantly if the assembly 
line transports a large set of various different types of objects, and if these objects 
appear in arbitrary orientation. In this case, the successful application of robots 
crucially depends on reliable object recognition techniques that must fulfill the 
following important requirements: 

Reliability: The algorithm should be robust with respect to noise and image 
variations, i.e. its performance should not degrade in case of slight variations 
of the object shape, poor image quality, changes in the lightning conditions 
etc. 

Speed: The online scenario of the application implies certain demands on the 
speed of the image recognition process. It must be able to provide the result 
of its computations in real time (compared to the assembly line speed) so 
that the robot can grasp its target from a slowly moving assembly line. 
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Fig. 1. The process pipeline. The system is trained by processing a single picture of 
each object and storing its contour in a database. By taking snapshots of the assembly 
line, it then automatically locates objects and describes their contours by a polygonal 
mesh. After matching them to the database objects, it is able to determine their type, 
position, and orientation, which are passed to the robot controller. 



Automation: The process should be completely automated, i.e., it should work 
reliably without any additional user interaction. 

In this paper, we describe a combination of algorithms that match the above re- 
quirements (fig.Q). It assumes that the robot is equipped with an image database 
in which a variety of different objects is stored by single prototype images. Using 
a camera which is mounted above the assembly line, the robot is expected to 
locate and identify by-passing objects, to pick them up and to put them to their 
final destination, thereby performing a task like sorting. The object recognition 
strategy proposed here is a two-stage process. The first stage locates individual 
objects and describes them by polygonal shapes. It is able to cope with almost 
all kinds of shapes, including those with holes or strong concavities, which is an 
important advantage over other contour descriptors such as many active contour 
approaches |31 . The second stage matches these shapes to the prototypes, detects 
the type of the objects, and computes their relative orientation. 

Alternative approaches to image segmentation combined with polygonization 
typically depend on an initial edge detection processes and do not offer any 
strategies to effectively control the precision of the triangular approximation 
in a statistically well-motivated manner iscq. Other object vision systems for 
assembly line robots require models of each object, which they iteratively fit to 
3D information from the scene m- A similar approach was followed in H2j 
where 2D contours were used to classify objects based upon p. 

2 Polygonal Image Segmentation 

Formally, we describe an image by a function I (o) that assigns each possible 
position Oi to a pixel value I (oi). In our current implementation, we just operate 
on binary images, i.e. I (o) S B, but the theoretical framework also applies to 
the multi-valued case. The image is assumed to consist of several areas a\, each 
being characterized by a homogeneous distribution of pixel values. We aim at 
finding a decomposition of the image I (o) into segments di, that are (at least 
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Fig. 2. (a) Binarized input image with superimposed initial mesh, (b) Early optimiza- 
tion stage after few splits, (c) Final triangulation, (d) Corresponding polygonal shape. 



after consistent renaming of their indices) in good accordance with the areas a\. 
Thus the desired output of our algorithm is a function a (o) that, up to the best 
possible degree, mirrors the composite structure a (o) of the image. 

We assume that the image formation can be described by a simple generative 
model: First an image site is selected according to a distribution p(pi) which 
is assumed to be uniform over all pixel positions. The site Oi is then assigned to 
an area a\ = a{oi). Depending exclusively on this area information, the image 
site is finally provided with a pixel value 1^ = 1 {oi) according to the conditional 
distribution p(Ifj_\a\). Replacing a\ by their estimates dj/ therefore yields 

p{o^,If,\a^)=p{oi)■p{IfJ,\ax)=p{o^)■p{If,\a^) . (1) 

According to [0|, the latent parameters d (o) should maximize the complete data 
likelihood, which - when assuming statistically independent pixel positions - can 
be written as £ = HiPioi,! (o ^) , d (oi)) = ]\iP{or,I (o^) |d (oi)) • p (d (oi)) . 
Inserting (P) and dropping constant factors, we obtain 

£ oc I]^(p(/^|di,)p(di,))"^^'‘’“‘'^ , (2) 

where n (/^, d^) denotes the number of occurrences that the pixel value is ob- 
served in segment d^. The corresponding negative log likelihood per observation 
is given by -^log£ oc - (^m> log (P ‘ £ (a>^))> where p (J^, d,.) 

is the probability of a joint occurrence of and di,, and n is the total num- 
ber of observations. — ^log£ can be decomposed into two parts. The first part 
is the eonditional entropy of the pixel values I (oi) given their assignments to 
polygons d(oi). The second part is the entropy of the a-priori distribution for 
the polygons, which is discarded to avoid any prior assumptions on the size of 
individual polygons. We arrive at the cost function 

H {d) = ~^p{I^,d,,)logp{I^\d,,) , (3) 

which is insensitive with respect to consistent renaming of the polygon indices. It 
can be shown to be minimal for perfect correspondances between the estimated 
and the true segmentations d{oi) and a(oi), respectively. Besides it is concave 
in p (d^, a\), which has the advantage that there are no local minima inside the 
probability simplex. 

We represent the segmentation d (o) by a triangular mesh [^1, which is re- 
fined by a hierarchical optimization strategy. Starting with 4 initial triangles. 
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we iteratively add new vertices to achieve a finer resolution. Once a new ver- 
tex v\ has been added, it is moved to the position where it causes the minimal 
partial costs with respect to ©• During this optimization, the movement of the 
vertex has to be restricted to the convex polygon that is formed by the straight 
lines connecting its adjacent vertices. Under certain preconditions, however, this 
constraint can be circumvented by swapping individual edges, which gives the 
algorithm additional flexibility. After having found the optimal position for the 
new vertex, all adjacing vertices are inserted into a queue from which they are 
extracted for further optimization (fig. E|)- 

The described algorithm can be implemented as a multiscale variant by down- 
sampling the original image into several resolution levels, optimizing the mesh 
on a coarse level, mapping the result onto the next finer level, and continuing the 
optimization there. In this case, however, one has to detect at which granularity 
the current image resolution is too low to justify any further mesh refinement. 
For the definition of an appropriate decision criterion, it is important to note 
that the cost function m does not take the exact position of individual pixels 
into account. Instead, it completely relies on the joint distribution of grey values 
and image segments. From Sanov’s theorem P|, one can thus infer that the 
probability of measuring a certain cost value H* is completely determined by 
the minimal KL-distance between the generating model p (/^, a\) and the set of 
all empirical probability distributions q{I^,a,y) for which the cost value H* is 
obtained. It can be shown that, among these distributions, the one with minimal 
KL divergence has the parametric form 



The correct value of (3 can easily be found by an iterated interval bisection 
algorithm. According to Sanov’s theorem, this leads to the probability estimate 



It can be used to compute the probability that a previous model generates an 
image with the same costs as the costs measured for the actual image model. If 
this probability is above a threshold the optimization algorithm decides 

that the optimization progress might also be due to noise, and switches to the 
next finer resolution level in which the respective grey value distributions can 
be estimated with higher reliability. If it already operates on the finest level, 
adjacing triangles that share their dominant grey value are fused into larger 
polygons, which are then passed to the shape matching algorithm. 

3 Application to Object Recognition 

For the shape matching, we employ the algorithm described in which has 
been found to feature both high accuracy and noise robustness. It first maps 
the shapes onto their normalized tangent space representations, which has the 
advantage of being invariant with regard to position and scale. After smoothing 



q* (X p{a^) ■ p{I^\a^)^ . 



( 4 ) 



Pr{H = H*} ^ 



( 5 ) 



A New Contour-Based Approach to Object Recognition 333 



(a) 





Fig. 3. (a), (b): Stapler segmented with = 0.75 and = 0.99, respectively, (c) 
Close-up of the fitted object boundary, (d) Result at a noise level of 40%. 



the shapes by an appropriate shape evolution process they are divided into 
their maximal convex and concave sub-arcs. This is motivated by the fact that 
visual object parts relate to the convex sub-arcs of the object shape. Based on 
this decomposition of shapes the similarity of two shapes is defined as a weighted 
sum over a set of many-to-one and one-to-many matching of consecutive convex 
and concave sub-arcs. 

With this similarity measure, the type of a query object can be determined by 
retrieving the most similar prototype in the image database. In order to facilitate 
the subsequent grasp operation, the database also contains an adequate gripper 
position {grasp point) and orientation for each prototype (fig. 0a). To initiate the 
grasping of the query object, we thus have to compute its rotation and translation 
parameters with respect to its prototype. Here we can take advantage of the fact 
that, to find the optimal matching between two polygonal sub-arcs at and bt in 
tangent space, the algorithm described in m implicitly rotates by the angle 

which minimizes the squared Euclidian distance {at (t) — bi (t) + dt 
between at and bi. The shape similarity measure ignores the angles (pi and is 
thus able to handle rigid object rotations and flexible joints. Here, however, we 
propose to compute the a-truncated mean of (pi , which gives us a robust estimate 
(p of the relative orientation of the whole contour. To localize the grasp point on 
the query shape, the boundary positions of the convex and concave sub-arcs are 
used as reference points (fig. 0b). Let pd denote the grasp point for the database 
object, Xi the associated reference points, and yi the reference points on the query 
shape. In addition define 6i as the Euclidean distance between Xi and pd, and 
di as the corresponding distance between yi and the current grasp position on 
the query shape, Pq. Our approach is to choose Pq such that the di give the best 
fit to the corresponding 5i. This problem setting is similar to the problem of 
multidimensional scaling (MDS), in which you are given pairwise dissimilarity 
values for a set of objects, which have to be embedded in a low-dimensional 
space. To find pg, we therefore adapt the SSTRESS objective function for MDS 
|E] and minimize J = which can be achieved by a standard gradient 

descent algorithm. Note, however, that each di corresponds to a partial contour 
match and therefore to a rotation angle (pi. If (pi is significantly different from 
the rotation angle (p of the whole object, then the corresponding di should not 
exert a strong influence on the cost function. We, therefore, downweight each 
term in J by exp {—\siv?{A(pi/2)') with A(pi := (pi — (p. Besides it is possible 
to sample additional positions along the sub-arcs to obtain additional reference 
points, which increases the robustness of the algorithm. 
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Fig. 4. (a) To adjust the position and orientation of the two-finger gripper, the system 
adapts the specifications for the prototype in the database, (b) The grasp points of 
database objects (left) and query (right) objects are defined by their relative distances 
from a set of control points, which are by-products of the shape matching algorithm. 





Fig. 5. (a) Segmentation with 3 quantization levels, run on a dithered camera image, 
(b) Segmentation with a textured background, demonstrating that the segmentation 
algorithm is not restricted to high contrast situations. Here a Floyd-Steinberg dithering 
with 32 quantization levels was used for preprocessing. 



4 Experimental Results and Discussion 

For a first evaluation of the system, a set of 6 different tools was chosen from a 
standard toolbox, including two rather similar saws, two wrenches, an alien key, 
and a stapler. From each object, 10 images were taken that showed the object in 
different positions and orientations. The images were binarized using a variant 
of the Lloyd-Max-Algorithm m From each object, we randomly selected one 
image for the prototype database, while the remaining images were joined to form 
the validation set. The performance was evaluated using 10-fold crossvalidation. 

In this scenario, we obtained an object recognition performance of 99.6% with 
a standard deviation of 0.8%. We also measured the precision of the orientation 
estimation and measured an average error of 2.2° with a standard deviation of 
3.2° compared to visual inspection. The average error for the grasp point es- 
timate was 5 pixels. The polygonization runs on a IGhz PC on images with 
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a resolution of 256x256 pixels in less than five seconds. There are still several 
possibilities to fasten it up significantly, namely stopping the multiscale opti- 
mization at a lower stage, finding a better initial mesh and working with areas 
of interests. 

The parameter has been found to be a powerful tool for controlling 

the accuracy with which the mesh is fitted to the object. See fig. |3 (a), (b) for 
two results with different values, and fig. El (c) for a close-up view of a 

generated contour. The availability of a reliable tool like this is important to cut 
down the computational complexity, because the object should not be described 
with a higher precision than necessary. A too complicated contour has to be 
simplified afterwards by the curve evolution approach described in d. Besides, 
the algorithm could be shown to remain very robust in the presence of noise. 
Fig. El (d) shows a test image in which 40% of the pixels had been set to random 
values. It demonstrates that the segmentation result remains nearly unaffected 
and still produces an accurate result. 

With slight modifications, the segmentation algorithm is also capable of pro- 
cessing non-monochrome objects or objects on textured backgrounds (fig. EJ. 
Instead of the Lloyd-Max quantization one can alternatively use the Floyd- 
Steinberg dithering algorithm [B| , which has the additional advantage of avoiding 
undesirable edges at the boundaries of homogeneously quantized image regions. 



5 Conclusion and Future Work 

We have presented a new computer vision method for analyzing objects on an 
assembly line. To automatically extract all information that is necessary to grasp 
and e.g. sort the objects by a robot, it employs a new fast and robust image 
segmentation algorithm that locates the objects and describes them by sets of 
polygons. These object representations are then used to compare the objects to a 
given set of prototypes, to recognize their type, and also to compute their exact 
position and orientation. Real-world experiments show that the proposed object 
recognition strategy produces high-quality results and matches all demands of 
the application (in terms of speed, robustness, and automation). 

Our framework offers several promising starting points for further extensions. 
In our current implementation, we restrict ourselves to Bernoulli distributions 
(i.e. binarized images). The generative model, however, is valid for arbitrary dis- 
crete probability distributions, and can thus also be applied to images where a 
larger number of possible grey-values is retained. When applied to these images, 
the segmentation algorithm is able to describe individual objects as structures 
with several adjacent or even nested parts. Although we currently use only the 
outer contour of an object as its descriptor, we expect that there is a generic 
extension of Latecki’s shape similarity concept to compound objects, which will 
be in the focus of our future research as well as speed improvements and the 
dependance of the performance of the polygonization on the parameter. 

Further work will also include the integration of the algorithms into an opera- 
tional automated production system. 
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Abstract. In this paper, knowledge-based recognition of objects in a bureau 
scene is studied and compared using two different systems on a common data 
set: In the first system active scene exploration is based on semantic networks 
and an A*-control algorithm which uses color cues and 2-d image segmentation 
into regions. The other system is based on production nets and uses line 
extraction and views of 3-d polyhedral models. For the latter a new probabilistic 
foundation is given. In the experiments, wide-angle overviews are used to 
generate hypotheses. The active component then takes close-up views which are 
verified exploiting the knowledge bases, i.e. either the semantic network or the 
production net. 



1 Introduction 

Object localization from intensity images has a long history of research, but has not 
led to a general solution yet. Approaches proposed differ in objectives, exploited 
features, constraints, precision, reliability, processing time etc. Although surveys exist 
on knowledge-based object recognition [8, 6, 1], little has been published on 
experiments by different groups on a common task. Comparisons mainly exist on 
data-driven or appearance-based approaches, e.g. on the COIL-data base [7]. We 
compare two different approaches developed by different groups to solve one 
common task. We chose the localization of a hole punch from oblique views on an 
office desk. Fig. 3a,f,g,h below show such frames taken with different focal lengths 
by a video camera. 

In the experiments camera parameters (focal length, pan, tilt) are adjustable and 
camera actions are controlled by the recognition process. The 3-d position of the hole 
punch is constrained by a desk. The rotation is restricted to the axis perpendicular to 
the ground plate (azimuth). The overviews are used to generate hypotheses of the hole 
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punch's position which result in camera actions to take close-up views. These are the 
input for final recognition or verification. 

In Sect. 2 we outline the structure and interaction of the two systems and present a 
new probabilistic foundation of the production net system. Results of experiments on 
a common data-base are given in Sect. 3. In Sect. 4 a discussion of pros and cons of 
both approaches is given. 



2 Architectures of the Two Systems 

Initially we describe how the two systems interact on common data. 




Fig. 1. Overview of the experimental localization setup with semantic network system (SN) and 
production net system (PN) 



Fig. 1 shows the different components of the two localization systems and the data 
flow between them. Starting with an overview color-image (like the one presented in 
Fig. 3a) two different algorithms are applied that generate hypotheses for the hole 
punch's location. The first system uses a pixel-based color classifier resulting in an 
interest map (Sect. 2.1), whereas the second system determines the hypotheses with a 
knowledge-based approach (Sect. 2.2). Both detection systems provide hypotheses as 
2-d-coordinates in the image and an assessment value. 

Based on the hypotheses, close-up views are then generated by adjusting pan and 
tilt and increasing the focal length of the active camera. Since close-up views contain 
objects in more detail, recognition is expected to be more reliable. Results of region 
segmentation are interpreted by the SN-system providing the center of gravity for 
each hypothesis. Lines constitute the input for the PN-system yielding a 3-d pose 
estimate for each hypothesis. This verification utilizes a different parameter setting 
and finer model compared to the detection phase. Both systems give an assessment 
value for the results. 
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2.1 Localization with the Semantic Network System (SN) 

The semantic network approach uses the object’s color as a cue to hypothesize 
possible positions of the hole punch in a scene. An interest operator based on 
histogram back-projection is applied [13], which learns the color distribution of the 
hole punch and applies this distribution to find pixels of similar color in the overview 
images. We calculate the color histograms in the normalized rg color space to be 
insensitive to illumination changes. Since the hole punch is red, the interest operator 
yields hypotheses for red objects. 

The verification of the hypotheses is done by matching the hole punch’s model to 
color regions. These regions are determined by segmenting the close-up views using a 
split and merge approach. The semantic network represents the 2-d object model by a 
concept which is linked to a color region concept [2]. The latter concept contains 
attributes for the region’s height, width, and color as well as the allowed value range 
for each of these attributes. During analysis the expected values for the object are 
compared to the corresponding feature values calculated for each color region of the 
close up views. A match is judged according to a probability based optimality 
criterion [3]. The search for the best matching region is embedded into an A*-search. 



2.2 Localization with the Production Net System (PN) 

Production nets [12] have been described for different tasks like recognition of roads 
in maps and aerial images, 3D reconstruction of buildings, and vehicle detection. A 
syntactic foundation using coordinate grammars is given in [9]. Initially contours are 
extracted from gray- value images and approximated by straight line segments. The 
production system works on the set of lines reconstructing according to the 
production net the model structure of the hole punch. This search process is 
performed with a bottom-up strategy. Accumulating irrevocable control and 
associative access is used to reduce the computational load [11,9]. 

The view-based localization utilized here for the hole punch search implements 
accumulation of evidence by means of cycles in the net with recursive productions 
[10]. The accumulation resembles generalized Hough transform [4]. The hole punch is 
modeled by a 3-d polyhedron. The 3-d pose space is equidistantly sampled rotating 
the object in azimuth a in steps of 10° and varying the distance cf in 5 steps of 10cm. 
For each of these 180 poses a 2-d model is automatically generated off-line by a 
hidden line projection assuming perspective projection and internal camera 
parameters estimated by previous calibration (see Fig. 2). The recognition relies on 
matching structures formed of straight line segments. Below only L-shaped structures 
are used, that are 4-d attributed by the location of the vertex and the two orientations. 
If a L-structure in the image is constructed from two lines, then similar L-structures in 
each 2-d model are searched, where the two orientations account for the similarity. 
Matches are inserted as cue instances into an 4-d accumulator of position in the image 
(x,y), azimuth a, and distance d. Accumulation is performed by recursive productions 
operating on the associative memory which is discretized in Pixel, 1° and 1cm. Values 
found in the accumulator highly depend on structures and parameters. High values 
indicate the presence of the object for a corresponding pose. 
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Fig. 2. Selected set of 2-d models projected from a 3-d polyhedron model (Aa=15°) 

We now replace the accumulator values by an objective function based on 
probabilistic assessment. For this purpose we modified the theory derived by Wells 
[14]. But while he uses contour primitives attributed by their location, orientation and 
curvature our approach matches L-structures; while he operates in the image domain 
we operate in the accumulator. 



Wells' Theory of Feature-Based Object Recognition 

Wells uses a linear pose vector j3 of dimension 4 (for similarity transform) or 8 (for 
limited 3-d rotations according to the linear combination of views method), and a 
correspondence function F, mapping the image features to a model feature or to the 
background. A scaled likelihood 

\-(T,P) = -\(P - - M+ S 

^ i,j:r|=Mj 






( 1 ) 



of an image-to-model correspondence and a pose is derived from independence and 
distribution assumptions. The first term in Eq. 1 results from a normal prior 
distribution on the pose, where is the corresponding covariance matrix and the 
center. The second term is due to the conditional probability that a set of 
correspondences between the image features Y. and model features M may be true, 
given a pose p. Wells gives the model features M. in a matrix format, that enables 
linear transformation to the image feature domain. Inside the sum there appears a 
trade-off rewarding each image-to-model correspondence by a constant X and 
punishing the match errors. The punishing term for each correspondence results from 
the assumption of linear projection and normal distributed error in the mapping of 
object to image features with covariance Xff. To reduce the complexity of the 
estimation process, this matrix is independent of the indices i and j. The reward term 
/I is to be calculated from a representative training-set according to 



T = 



In 



(27iy 



m 



(1-B) W, W, 

® VM 



( 2 ) 



The middle factor in this product is calculated from the ratio between the probability 
B that a feature is due to the background, and the probability (l-B)/m that it 
corresponds to a certain model feature, where m is the number of features in the 
model. The rightmost factor in the product is given by the ratio between the volume 
of the whole feature domain Wj ... W, and the volume of a standard deviation ellipsoid 
of y/. 
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Modification for Accumulator-Productions in the PN-System 

To apply the theory of Wells to our problem we set 0={x,y,(Xd). The objective 
funktion L is calculated for each cluster of cues. The pose j3 is estimated as mean 
=(x,y,a,d) of the poses of the member cues of the cluster. The correspondence 

r is coded as an attribute of the cues. For each model feature j put into 
correspondence in the cluster the closest cue i to the mean is taken as representative of 
the set of all cues i corresponding to j. This is done, because we regard multiple cues 
to the same model feature as not being mutual independent. The attribute values (x^, 
y,. a., dj directly serve as Y. for formula (1). There is no need for coding model 
features in a matrix format, because the projection has been treated off-line in the 
generation of the 2-d models. We just determine the deviation for each such cue 



L = 



Min 



Uy,- p) 



(3) 



The covariance matrix y/ of the cues and the background probability B are estimated 
from the training-set. These differ in the present bureau application significantly 
between overviews and close-ups. For the overviews the reward A is close to the 
critical value zero indicating that recognition in these data is difficult and not very 
stable. Recall that the maximization must not take those F into account, that include 
negative terms into the sum. This condition gives a new way to infer the threshold 
parameters for adjacency in the cluster productions from a training set. In the 
verification step parameters are set different compared to the detection step, e. g. the 
accumulator is now sampled in Aa=5° and Zkf=5cm. Fig. 2 shows 2-d models used 
for close-up views, whereas Fig. 3d,e show two coarser 2-d models used for the 
overviews. 

The theory of Wells rejects scenes as non recognizable, if A turns out to be 
negative according to Eq. 2. In such situation we still may use a positive reward A' 
instead indicating that cues with high values for this objective function will contain 
more false matches than correct ones with high probability. Still among the set of all 
cues exciding a threshold, there will be the correct hypothesis with a probability that 
may be calculated from the difference A- A'. 

For the close-ups a ML-decision is needed and we have to use the correct reward 
term A. For these data the estimation for A is much bigger. Compared to the Hough- 
accumulator value used as decision criterion in [10] the likelihood function includes 
an error measurement on the structures set in correspondence with the model and 
evaluates the match based on an estimated background probability. 



3 Experiments 

Each system used its own set of training images for hypothesis generation and object 
recognition. The training of the SN-system is based on 40 close-up images for model 
parameter estimation and a histogram for red objects that is calculated using one 
close-up image of the hole punch. For the PN-system 7 desk scene overview and 7 
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close-up images were used as training set. The evaluation was done on 1 1 common 
scenes disjoint from the training sets. One of these overviews is depicted in Fig. 3a. 
For each test scene both systems generated their hypotheses (Fig. 3b, c), and 
corresponding close-up sets were taken by the scene exploration system. 

Success and failure was judged manually. In Fig. the highest objective function 
value L is detected by the PN-system in the correct location. Fig. 3d presents the 2-d 
model corresponding to this result. The pose is incorrectly determined on this image. 
Fig. 3e shows 2-d model of a cue cluster with correct pose and location but having a 
smaller likelihood. On the 11 overview images only two localization results are 
successful where one gives the correct pose, too. This shows that in this case pure ML 
is not sufficient. Therefore clusters are sorted according to L and the l%o-highest-L 
scoring clusters were taken as hypotheses (see white crosses in Fig. 3b). A successful 
localization according to the definition is contained in 5 of the 1 1 hypotheses sets. 

The color-based detection of the SN-system does not determine pose. It gives 8 
correct ML-localization results in the overview images. In 10 results the hypotheses 
set contains the correct cue. Fig. 3c shows an interest-map of the overview image. 
Dark regions correspond to high likelihood of the hole punch's position. Note that 
hypotheses sets of the two systems differ substantially, as can be seen comparing Fig. 
3b and Fig. 3c. Where the SN-system finds red objects like the glue stick and the 
adhesive tape dispenser, the PN-system finds rectilinear structures like books. 

Fig. 3f,g,h show the three close-up views taken according to the PN-system 
detection. In the verification step the ML-decision includes all cues from a close-up 
set resulting from one overview. Fig. 3i,j,k display the result, where the third scores 
correctly the highest. A successful verification with the PN-system additionally 
requires the correct pose. This is performed correctly on 3 of the 11 examples. The 
SN-system succeeds on 9 close-up sets without giving a pose. The PN-system 
succeeds on one of the two failure examples of the SN-system. 



4 Discussion 

In this contribution we demonstrated how the difficult problem of object recognition 
can be solved for a specific task. An office tool is found by two model-based active 
vision systems. In both cases the object was modeled manually in a knowledge base. 
A 3-d polyhedral model was used in the PN-system requiring line segmentation for 
the recognition. 2-d object views were modeled in SN-system using a region based 
approach. The experiments revealed that color interest maps of the SN-system 
outperform the line-based hypothesis generation of the PN-system on the considered 
scenery. We conjecture that this is due to high color saturation and small size of the 
object in the overview images. Close-up views captured by the active camera increase 
the recognition stability of both systems; in some cases overview images already 
yielded the correct result. 




Where Is the Hole Punch? Object Localization Capabilities 



343 






Camera action 



Verification PN 



Detection PN 



L=088 



L=100 



Fig. 3. Localization of the hole punch in a bureau scene; close-up views and verification of the 
SN-system omitted 



For the line-based recognition the process had to be parameterized differently for 
overview and close-up images. A new probabilistic objective function for the PN- 
system allows parameter inference from a training set, and opens the way for a better 
interpretation of the results. Both systems achieved recognition rates that - with 
respect to the complexity of the task - were satisfactory. It is expected that the 
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combination of line and color segmentation will eventually outperform either 
approach. This is subject to future work. 

The PN-system is designed to work on T-shaped structures as well. Other 
possibilities like U-shaped structures would be a straight forward extension. Further 
investigations will include an EM-type optimization of pose and correspondence in 
the final verification step also following [14]. 

We proved that one common experimental set-up can be used by two different 
working groups to generate competitive hypotheses and to verify these hypotheses, 
even in an active vision system. The image data is publicly available to other groups 
to allow further comparisons under the web site of the authors. 
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Abstract. By interferometric SAR measurements digital elevation mod- 
els (DEM) of large areas can be acquired in a short time. Due to the sen- 
sitivity of the interferometric phase to noise, the accuracy of the DEM 
depends on the signal to noise ratio (SNR). Usually the disturbed el- 
evation data are restored employing statistical modeling of sensor and 
scene. But in undulated terrain layover and shadowing phenomena oc- 
cur. Furthermore, especially in urban areas, additional effects have to 
be considered caused by multi-bounce signals and the presence of domi- 
nant scatterers. Unfortunately, these phenomena cannot be described in 
a mathematically closed form. On the other hand it is possible to exploit 
them in model-based image analysis approaches. In this paper we pro- 
pose a method for the segmentation and reconstruction of buildings in 
InSAR data, considering the typical appearance of buildings in the data. 



1 Introduction 

The improved ground resolution of SAR suggests to employ this sensor for the 
analysis of man-made objects But, due to specific SAR effects, this im- 

agery is difficult to interpret. Particularly in urban areas, phenomena like layover, 
shadow, multi-path signals and speckle have to be considered. 

The phase of interferometric SAR (InSAR) depends on the elevation of ob- 
jects in the scene. Especially in areas with low (SNR) - respectively poor coher- 
ence - the phase information is disturbed. Hence, the noise component has to be 
removed or at least reduced before further analysis. Often, the noise is decreased 
by averaging, e.g. low-pass filtering or multi-look processing. However, this leads 
to reduced spatial resolution and blurred phase jumps at object boundaries. 

Other methods base on statistical modeling of sensor and scene 0. With 
bayesian inference the data is restored to a configuration which most probably 
caused the measured data jS|. Scene and sensor are modeled according to the 
central limit theorem, which requires a large number of independent scatterers 
per resolution cell, contributing each a small impact to the measured signal 
only. In case of urban areas this assumptions do not hold anymore, because 
of the presence of dominant scatterers and the privileged rectangular object 
alignment. Furthermore, the number of scatterers per resolution cell decreases 
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with growing resolution of the sensors. Thus, in urban areas the data can not be 
restored without additional knowledge about the scene. 

Recently, approaches for building reconstruction in InSAR elevation data 
were proposed, which apply either model-based machine- vision methods or 
take typical phenomena into account, like layover and shadowing (3- These meth- 
ods base on the analysis of the geocoded DEM alone. In this approach the entire 
set of complex InSAR information (phase, intensity and coherence) is analyzed 
for the segmentation of extended buildings. 

2 SAR and InSAR Principle 

2.1 Synthetic Aperture Radar (SAR) Principle 

An air- or spaceborne sensor illuminates the scene with radar pulses, the runtime 
of the returned signal is measured. Since the antenna footprint on the ground 
is large, a side-looking viewing geometry is required to determine the distance 
between objects in the scene and the sensor from the runtime. The signal is partly 
either reflected away from the sensor or scattered towards the sensor, depending 
on the roughness of the scene compared to the signal wavelength (e.g. X-band: 
3cm, P-band: 64cm). The object positions in the azimuth direction are obtained 
by coherently integrating the signal of many pulses along the trajectory of the 
carrier (synthetic aperture) . The resolution of SAR is a function of the impulse 
bandwidth in range and the length of the synthetic aperture in azimuth. 

The illumination geometry and the coherent measurement principle give rise 
to some phenomena (Fig. Q). Layover occurs always at vertical building walls 
facing towards the sensor. This leads to a mixture of signal contributions from 
the building and the ground in the SAR image. On the other side the building 
casts a shadow which occludes smaller objects behind. The height of a detached 
building can be derived from the shadow length and the viewing angle 6. Multi- 
bounce signals between building walls and the ground or at corners structures 
lead to strong signal responses. Sloped rooftops which are oriented perpendicular 
towards the sensor cause very strong reflections as well. 

2.2 InSAR Principle 

SAR interferometry takes benefit from the coherent SAR measurement principle. 
Fig. Elillustrates the principle of airborne single-pass across-track interferometry 
measurement. Two antennas are mounted above each other on the carrier with 
geometric displacement B. One of the antennas illuminates the scene and both 
antennas receive the backscattered complex signals. 

The interferogram S is calculated by a pixel by pixel complex multiplication 
of the master signal si with the complex conjugated slave signal S 2 - Due to the 
baseline B, the distances from the antennas to the scene differ by Ar, which 
results in a phase difference in the interferogram: 




S' = Si • S 2 = oi • 02 • with Aip 



( 1 ) 
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Fig. 1. SAR phenomena at buildings 




Fig. 2. Geometry of InSAR measurement 



where A denotes the wavelength. The phase difference is unambiguous in the 
range — tt < A^p < tt only. Thus an phase-unwrapping step is often required 
before further process. Furthermore, the range dependency of Alp has to be 
removed (flat earth correction). Afterwards, the relative height differences Ah 
in the scene can be approximated from Aip: 



Ah 



A r ■ sin(0) 
2ttB cos(^ — 0) 



( 2 ) 



with parameters distance r, wavelength A, antenna geometry angle ^ and 
viewing angle 9. The standard deviation of the interferogram phase depends 
approximately on the SNR and the number of independent looks L: 



^/MR ' (/l) 

With equations El and 0 the standard deviation of the elevation is obtained: 

A r • sm{9) 

^ 2^ ' cos(C - 9) ■ y/Sm ■ y/L 



(4) 
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The coherence 7 is a function of the noise impact of the interferogram. It 
is usually estimated from the data by the magnitude of the complex cross- 
correlation coefficient of the SAR images: 



^ SNR 

n—1 

Hence, the local quality of an InSAR DEM can be directly assessed from the 
data by the related coherence value. 

3 Approach 

Knowledge about the typical appearance of terrain in the data may be incor- 
porated to discriminate urban from rural areas (Figure |3)- For the latter task, 
classification schemes have been proposed in the literature which base 

on the statistical properties of SAR imagery. If such a preliminary classifica- 
tion is carried out, the elevation data of natural areas are smoothed with stan- 
dard methods like averaging. The complementary areas are processed with a 
phenomenology-based approach described below in more detail. 





Fig. 3. Suitable restoration of elevation data after classification 



3.1 Modeling of Scene and Objects 

A section of the test dataset Frankfurt is depicted in Fig. 01 The modeling of 
the scene is based on the following observations and assumptions: 

— Man-made objects often have right-angled or cylindrical structures and ap- 
pear as regions with similar intensity. Object boundaries in many cases co- 
incide with clearly visible intensity edges. 
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~ The elevation is modeled to be piecewise linear. Steps appear at building 
walls only. 

~ Buildings may contain substructures with different height or roof material. 
Their rooftops are assumed to be flat. 

~ At the building wall facing towards the sensor, layover and double-bounce 
occurs. This leads to areas of brighter intensity. 

— At the other building wall an area on the ground is occluded (shadow area) . 
This shadow shows low intensity and poor coherence. 

— Very ffat objects, like roads, ponds or flat roofs without any structure at all, 
behave very similar to shadow areas, because the signal is totally reflected. 




Fig. 4. Test data: intensity, elevation and coherence 



3.2 Segmentation and Reconstruction of Buildings 

The initial segmentation is carried out by a combined region growing in the 
intensity and the elevation data. Preprocessing is required to achieve a reason- 
able segmentation result: The intensity data is de-speckled and the elevation 
information is smoothed by median filtering. Usually, most of the segments are 
extracted from the intensity data. The segmentation in the elevation data is 
mainly required to distinguish rooftops from adjacent ground which sometimes 
appear very similar in the intensity channel. It is crucial to detect as many object 
boundaries as possible. Hence, the adaptive region growing threshold thrg can 
be set to a small maximum value. As a consequence, over-segmentation occurs, 
which is corrected in a subsequent post-processing step described below. 

For each segment an average height is calculated independently. Considering 
the noise influence, the elevation values are weighted with the coherence in the 
averaging step. This results in preliminary depth map with prismatic objects. 
Segments with low average intensity or coherence are regarded as unreliable. 
These segments are assumed to coincide with shadow areas or roads, and are 
considered later to check the consistency of the results. 

Two types of building parts are discriminated. Mayor building parts have 
a significantly larger average height than at least one adjacent segment. They 
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are found at boundaries between buildings and ground. Minor building parts 
are surrounded by mayor building parts and correspond to superstructures. Ad- 
jacent building segments are joint and form a set of more complex structures. 
Furthermore, the segments in question have to match the building model with 
respect to area, shape and compactness. 

Shadow cast from a building leads to either stripe or L-shaped segments, 
depending on the aspect. Their width is a function of viewing angle and object 
elevation. Hence, for each building candidate an expectation area for a shadow 
stripe is predictable. Unfortunately, shadow cannot always be distinguished from 
objects which appear similar in the data, like roads. Therefore, as a minimum 
requirement, an area of the set of unreliable segments is expected to be found 
at the predicted shadow location. If so, the candidate segment is labeled to be 
a building. In case shadow does not interfere with roads, a more subtle analysis 
is carried out. Shadow stripes are extracted in the intensity data with a sim- 
ple structural image analysis algorithm. Starting with short lines as primitives, 
stripes are built-up, which enclose regions of low intensity. 

Shadow areas are used as well to overcome under-segmentation. Segments, 
which contain a possible shadow area,, are further investigated in a post pro- 
cessing step P). Under-segmentation is corrected in two different ways. If the 
histogram of the original elevation values shows a bimodal curve, the segment is 
split in two closed segments, if possible. In a second approach a region growing 
step in the median filtered elevation data is performed. In contrast to the ini- 
tial segmentation, the border towards the shadow region is used as seed and the 
threshold is smaller jO]. Over-segmentation is corrected by merging adjacent seg- 
ments with similar heights. After post-processing, the depth map is recalculated 
in the manner described above. 



4 Results 

The test data (Fig. ^ was acquired with the airborne AER-II sensor of FGAN 
ins. Range direction is from left to right, ground resolution is approximately 
one meter. From the sensor configuration the standard deviation of the eleva- 
tion measurement is estimated to be about two meter in the best case. Several 
extended buildings are present in the scene at the airport Frankfurt. The rooftops 
are generally flat with small elevated superstructures, mostly required for illu- 
mination and air-conditioning inside the building (Fig. Et). 

Thresholds are related to a maximum value of 255, except coherence which 
is defined to be in the range [0, 1]. The threshold th^g for the initial region 
growing was set to 10. In Fig.Eb possible layover segments (high intensity) and 
unreliable segments (low average intensity or coherence, thint = 70 and thcoh = 
0.75) are shown. The layover candidates coincide as expected with building walls 
towards the sensor. The superstructures on the rooftops show bright intensity 
as well. However, closed layover candidate segments could not be detected at 
every suitable location. The area considered as unreliable covers the roads and 
is found at the building walls facing away from the sensor. It turned out that 
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especially shadow cast from objects with small height difference to the ground 
could not be segmented from the average intensity data alone. This behavior is 
probably caused from multi-bounce effects. The extracted shadow stripes (Fig. 
B) correspond well with the unreliable segments. 




Fig. 5. a) Aerial image, b) Segmentation: layover (bright) and unreliable areas (dark), 
c) shadow stripes 



In Fig. 1^ the extracted mayor and minor building segments are depicted. 
The depth map in Fig. 03 shows all details. Especially at layover areas over- 
segmentation occurs. In Fig. 0 the minor objects are neglected to give a more 
general overview of the scene. This result is superimposed with building foot- 
prints extracted from the aerial image. The missing part of the large building 
was rejected because of poor coherence. One building was not detected, probably 
caused by interference with the large building nearby. This behavior is subject 
to further studies. The forest in the lower left showed similar properties and 
could not be distinguished from the buildings. The remaining buildings match 
well with the reference data. Note the building in the middle left, missing in the 
aerial image. It was erected in the period between the acquisition of the aerial 
image and the InSAR measurement campaign. The accuracy of the averaged 
elevation data was in the order of the standard deviation of the measurement. 

5 Conclusion and Future Work 

A model-based approach for the segmentation of buildings in InSAR data was 
proposed. The results are useful for an improved interpretation of the scene. In 
case information about vast areas has to be acquired very fast, for example in a 
disaster scenario like an earthquake, the approach may be employed for change 
detection. However, the method is limited to extended, detached and flat roofed 
buildings. In future work a subsequent reconstruction step will be carried out, 
which yields a generalized vector description of the buildings. Additionally, the 
object model will be expanded to sloped objects like gable roofs. 
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Fig. 6. Results: Building segments, detailed depth map, depth map with reference data 
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Abstract. We consider approaches to computer vision problems which 
require the minimization of a global energy functional over binary vari- 
ables and take into account both local similarity and spatial context. The 
combinatorial nature of such problems has lead to the design of various 
approximation algorithms in the past which often involve tuning param- 
eters and tend to get trapped in local minima. 

In this context, we present a novel approach to the field of computer vi- 
sion that amounts to solving a convex relaxation of the original problem 
without introducing any additional parameters. Numerical ground truth 
experiments reveal a relative error of the convex minimizer with respect 
to the global optimum of below 2% on the average. 

We apply our approach by discussing two specific problem instances re- 
lated to image partitioning and perceptual grouping. Numerical exper- 
iments illustrate the quality of the approach which, in the partition- 
ing case, compares favorably with established approaches like the IGM- 
algorithm. 



1 Introduction 

The minimization of energy functionals plays a central role in many computer 
vision problems. When additionally discrete decision variables are involved, this 
task becomes intrinsically combinatorial and hence not easy to tackle. This fact 
motivated the development of various optimization approaches, like simulated 
annealing for binary image restoration the ICM-algorithm for Markov Ran- 
dom Field based estimates |2], deterministic annealing for perceptual grouping 
0, and many more (see 0). 

However, two crucial requirements from the optimization point-of-view con- 
tinue to challenge these results: The quality of suboptimal solutions that can be 
achieved and the presence of additional parameters that have to be tuned by 
the user. Concerning quality, only simulated annealing can guarantee to find the 
optimal solution (see 0) at the cost of unpractically slow annealing schedules. 
Other approaches are not immune against being trapped in an - eventually bad 
- local minimum. On the other hand, additional algorithmic tuning parameters 
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for controlling search heuristics, annealing schedules, etc., often critically influ- 
ence the quality of a solution despite having nothing to do in the first place with 
the original problem to be solved. 

One possibility to overcome this latter point is to consider the mathemati- 
cally well-understood class of convex optimization problems: They only have one 
global minimum which can be determined by established numerical algorithms 
without using additional parameters. In order to tackle a highly non-convex com- 
binatorial optimization problem in this way, it is relaxed by mapping it into a 
higher dimensional space where the feasible solutions are contained in a convex 
set (see |S| ) . Minimization in that space and mapping back yields a suboptimal 
solution of the original problem, which usually is of a good quality. For special 
combinatorial problems it is even possible to give an optimality bound: Goemans 
and Williamson 0 proved the remarkable result that suboptimal solutions com- 
puted by convex optimization for the well known max-cut problem are at most 
14% worse than the global optimum. 

These favorable properties of convex optimization approaches have motivated 
our work. In the following sections we show how problems from the general class 
of binary quadratic functionals may be relaxed and solved by convex optimiza- 
tion. We next apply our approach to two specific computer vision problems 
(binary image partitioning and perceptual grouping) that fit into this class and 
present numerical results. Additionally, we show the quality of our approach by 
checking it using one-dimensional signals for which the optimal solution (ground 
truth) can be easily computed, and by comparing the image partitioning results 
with those of the ICM-algorithm. 

2 Problem Statement: Minimizing Binary Quadratic 
Functionals 

In this paper, we consider the problem of minimizing binary quadratic function- 
als which have the following general form: 

J{x) = x^Qx + 2b^x + const , x S {—1, 1}", Q S 5 ", 6 G R" . (1) 

As no constraints are imposed on the matrix Q (apart from being in the class 
of symmetric matrices 5"), such a functional is generally not convex. Further- 
more, the integer constraint Xi G { — 1,1}, i = 1 ,... ,n, makes the minimization 
problem © intrinsically difficult. 

Computer vision problems with quadratic functionals like © arise in various 
contexts; in this paper, we will consider two representatives that are briefly 
illustrated in the following sections. 

2.1 Binary Image Partitioning 

Suppose that for each pixel position i of an image, a feature value G M has been 
measured that originates from either of two known prototypical values ui , u-fl. 

^ Besides the case of gray-values considered in this paper, local features related to 
motion, texture, etc., could also be dealt with. 
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Fig. 1. A binary image, heavily corrupted by (real-valued) noise. 



Figure ^ shows an example, where g gives the gray- values of the noisy image 
that originally was black and white. 

To restore the discrete-valued image function x that represents the original 
image, we wish to minimize the following functional which has the form of O: 

( 2 ) 

Here, the second term sums over all pairwise adjacent variables in vertical and 
horizontal direction on the regular image grid. 

Functional o comprises two terms familiar from many regularization ap- 
proaches 0: A data- fitting term and a smoothness term modeling spatial con- 
text. However, due to the integer constraint Xi S {—1,1}, the optimization 
problem considered here is much more difficult than standard regularization 
problems. 

In comparison to the ICM-algorithm 0, which minimizes a similar objective 
function, there are two main differences: Whereas the functional o sums over all 
pixels and hence is a global approach, the ICM-algorithm optimizes the objective 
function locally for each pixel in turn, which results in a much smaller complexity. 
The second difference is that the ICM-algorithm also uses the diagonally adjacent 
variables in the smoothness term for each pixel. 



2.2 Perceptual Grouping 

The task in perceptual grouping is to separate familiar configurations from the 
(unknown) background. Figure El shows an example: The instances of a given 
object consisting of three edge-elements are to be found in a noisy image. 

To describe this problem mathematically, let gi,i = 1, . . . ,n, denote a fea- 
ture primitive (e.g. an edgel computed at location i) in the image plane. We 
suppose that for each pair of primitives gi^gj a similarity measure dij can be 
computed corresponding to some given specific object properties (e.g. the dif- 
ference between the relative angle of the edgels and the expected one according 
to the given object). Using the spatial context modeled by the values dij, each 
primitive gi is labeled with a decision variable Xi S {—1,1} (“1” corresponding to 
figure, “-1” corresponding to background and noise) by minimizing a functional 
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Fig. 2. Left: An object. Right: The object was rotated by a fixed arbitrary angle, and 
translated and scaled copies have been superimposed by noise. 



of the form (HJ: 

J (^) — ^ ^ (-^ (^ij^XiXj H~ 2 ^ ^ ^ ^ (^ij )^2 : (^) 

{i,j) i j 

where the first term sums over all pairs of feature primitives (see Pj for further 
details) . 



3 Convex Problem Relaxation and Algorithm 



To overcome the hard integer constraints, problem is relaxed to arrive at 
a less restricted, convex optimization problem of real variables, which can be 
tackled by using known methods like interior point algorithms that lead to an 
approximate solution for the original problem. 

By dropping the constant and homogenizing the objective function, we 
rewrite the original functional © as follows: 



x*Qx + 2b*x 





( 4 ) 



With slight abuse of notation, we denote the vector (x l)* again by x. Then the 
minimization problem can be rewritten: 

inf x^ Lx = inf L • xx* , 

aiG{ — 1,1}”'+^ xG{ — 1,1}^+^ 

where X = trace(Xy) denotes the standard matrix inner product for two 
symmetric matrices X, Y . 

The convex relaxation is now achieved be replacing the rank-one matrix 
XX* S /C by an arbitrary matrix X G }C, where /C denotes the cone of symmetric, 
positive semidefinite (n-l-1) x (n-l-l)-matrices. Additionally, the integer constraint 
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Xi G { — 1, 1} is weakly replaced by the according constraint xu = 1, so that we 
get: 



inf L» X , Xu = 1, Vi . (5) 

xeK. 

The optimization problem e belongs to the class of conic programs, which 
can be solved approximately with iterative interior-point algorithms (see |^). To 
finally get back the solution x to O from the computed solution X to , we 
used the randomized-hyperplane technique described in 

4 Numerical Experiments 

4.1 Restoration of ID Noisy Signals Comprising Multiple Scales 

To investigate the performance of the convex relaxation approach ®, we chose 
a one-dimensional synthetic signal x' (see FigureED that was superimposed with 
Gaussian white noise with standard deviation a = 1.0, and tried to restore the 
original signal by using the one-dimensional version of functional 0). Since it is 
possible to compute the global optimum x* of o for one-dimensional signals by 
using dynamic programming, we were able to compare the resulting suboptimal 
solution X to the optimal solution x*. This experiment was repeated 1000 times 
for different noisy signals and varying values of A. The results revealed an average 
relative error of below 2% for both the objective function value and the number 
of correctly classified pixels (see Figure 0. A representative example of the 
restoration is given in Figure 0 



4.2 2D-Images and Grouping 

The results of applying the algorithm to restoration problems in the context of 
two-dimensional images are shown in Figures ME The quality of the restora- 
tions using convex optimization is encouraging. Small errors that occur at the 
corners of connected regions as in Figure 0 (b) are due to the fact that the 
homogeneity constraint is less restrictive there as the local structure resembles 
noise. 

We also implemented the locally classifying ICM-algorithm (see |3), and 
applied it for different values of the parameter k/ 3, which plays a similar role as 
the parameter A in ©• The comparison reveals the power of our global approach: 
Whereas the ICM-algorithm leaves some local holes within the structure (see 
Figures Etd), 13(c), EJc)), convex optimization is able to close these holes and 
yield a better piecewise-smooth result. 

Concerning the grouping problem, the results of convex optimization are 
equally good (see Figure 0: Besides a small number of misclassified edgels that 
couldn’t be identified as background by the chosen similarity measure, the orig- 
inal structure is clearly visible. 
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Fig. 3. Signal x' comprising multiple spatial scales. 




Fig. 4. Average relative error of the objective function J and average percentage of 
misclassified pixels of the suboptimal solution x compared to the optimal solution x* 
for different values of the scale parameter A. 
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Fig. 5. A representative example illustrating the statistics shown in Fig.^ Top: Noisy 
input signal. Middle: Optimal Solution x* . Bottom: Suboptimal solution x. 
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Fig. 6. Arrow and bar real image, (a) Noisy original, (b), (c): Suboptimal solutions 
computed with convex optimization for A = 0.8, 3.0. (d) Solution computed with the 
ICM-algorithm for k/? = 0.99. 




(d) 



Fig. 7. Iceland image, (a) Binary noisy original, (b) Suboptimal solution computed 
with convex optimization for A = 2.0. (c) Solution computed with the ICM-algorithm 
for k/3 = 0.9. (d) Original before adding noise. 




Fig. 8. Checkerboard image, (a) Noisy original, (b) Suboptimal solution computed 
with convex optimization for A = 1.5. (c) Solution computed with the ICM-algorithm 
for Kf3 = 0.4. 
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Fig. 9. Grouping, (a) Input data (see Section [2.2(1 . (b) Suboptimal solution computed 
with convex optimization for A = 0.9. (c) The true solution. 

5 Conclusion 

The quality of the results of the numerical experiments approves the wide range 
of applicability of the convex optimization approach. Many other problems with 
objective functions from the large class of binary quadratic functionals could 
be tackled. This fact, together with the nice mathematical property that no 
additional tuning parameters are necessary encourages us to extend our work on 
the convex optimization approach. 
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Abstract. We present a novel approach to the weighted graph-matching 
problem in computer vision, based on a convex relaxation of the underly- 
ing combinatorial optimization problem. The approach always computes 
a lower bound of the objective function, which is a favorable property in 
the context of exact search algorithms. Furthermore, no tuning parame- 
ters have to be selected by the user, due to the convexity of the relaxed 
problem formulation. 

For comparison, we implemented a recently published deterministic an- 
nealing approach and conducted numerous experiments for both estab- 
lished benchmark experiments from combinatorial mathematics, and for 
random ground-truth experiments using computer-generated graphs. Our 
results show similar performance for both approaches. In contrast to the 
convex approach, however, four parameters have to be determined by 
hand for the annealing algorithm to become competitive. 



1 Introduction 

Motivation. Visual object recognition is a central problem of computer vision 
research. A key question in this context is how to represent objects for the purpose 
of recognition by a computer vision system. Approaches range from view-based to 
3D model-based, from object-centered to viewer-centered representations [1], each 
of which may have advantages under constraints related to specific applications. 
Psychophysical findings provide evidence for view-based object representations 
[2] in human vision. 

A common and powerful representation format for object views is a set of 
local image features V along with pairwise relations E (spatial proximity and 
(dis) similarity measure), that is an undirected graph G = (V,E). In this paper, 
we will discuss the application of a novel convex optimization technique to the 
problem of matching relational representations of object views. 

Relations to previous work. There are numerous approaches to graph-matching 
in the literature (e.g., [3-8]). Our work differs from them with respect to the fol- 
lowing points: 
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1. We focus on problem relaxations, i.e. the optimization criterion equals the 
original one but is subject to weaker constraints. As a consequence, such 
approaches compute a lower bound of the original objective function which 
has to be minimized. 

2. The global optimum of the relaxed problem can be computed with polynomial- 
time complexity. 




Fig.l A graph based on features de- 
scribed in [9] using corresponding 
public software ( http://www.ipb.uni- 
bonn.de /ipb /projects /fex /fex.html) . This 

graph has \V\ = 38 nodes. 



The first property above is necessary for combining the approach with an 
exact search algorithm where lower bounds of the original objective function are 
needed. Furthermore, it allows to compare different approaches by simply ranking 
the corresponding lower bounds. 

The second property is important since graph-matching belongs to the class of 
NP-hard combinatorial problems. Matching two graphs with, say, \V\ =20 nodes 
gives ~ 10^® possible matches. Typical problem instances however (see Fig. 1) 
comprise \V\ > 20 nodes and thus motivate to look for tight problem relaxations 
to compute good suboptimal solutions in polynomial time. 

Contribution. We discuss the application of novel convex optimization tech- 
niques to the graph-matching problem in computer vision. 

First, we sketch a recently published deterministic annealing approach [6, 10] 
which stimulated considerable interest in the literature due to its excellent per- 
formance in numerical experiments. Unfortunately, this approach cannot be inter- 
preted as a relaxation of the graph-matching problem and requires the selection 
of (at least) four parameters to obtain optimal performance (Section 3). 

Next we consider the relaxed problems proposed in [11,3] and show that, by us- 
ing convex optimization techniques based on the work [12], a relaxation of the 
graph-matching problem is obtained with equal or better performance than the 
other approaches (Section 4). Moreover, due to convexity, no parameter selection 
is required. 

In Section 5, we report extensive numerical results with respect to both benchmark- 
experiments [13] from the field of combinatorial mathematics, and random ground- 
truth experiments using computer-generated graphs. 

Remark. Note that, in this paper, we are exclusively concerned with the op- 
timization proeedure of the graph-matching problem. For issues related to the 
design of the optimization eriterion we refer to, e.g., [4,14]. 
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Notation. 



X^: transpose of a matrix X 

O: set of orthogonal n x n-matrices X, i.e. X*X = I {I: unit matrix) 

£: matrices with unit row and column sums 

X": set of non-negative matrices 

n : set of permutation matrices X £ O Ci S Ci Af 

e: one-vector e, = 1, i = 1, . . . ,n 

vec [A]: vector obtained by stacking the columns of some matrix A 
A (A): vector of eigenvalues of some matrix A 



2 Problem statement 

Let G = (V,E), G' = {V',E') denote two weighted undirected graphs with 
\V\ = \V'\ = n, weights {wy},{wL}, and adjacency matrices Aij = Wij, Ah = 
w'ij, i,j = 1, . . . , n. Furthermore, let <p denote a permutation of the set {1, . . . , n} 
and X £ n the corresponding permutation matrix, that is Xij = 1 if = j 
and Xij = 0 otherwise. The weight functions w,w' : E C V x V ^ encode 
(dis) similarity measures of local image features V), i = 1, . . . ,n, which we assume 
to be given in this paper. We are interested in matching graphs G and G' by 
choosing a permutation (f)* such that 

4>* = argmm^(w,^p),^(j) - w-jf . 

By expanding and dropping constant terms, we obtain the equivalent problem: 

= ^x^M-tT(A' X AX^)) , 

0 ' A 

hj 

with tr(-) denoting the trace of a matrix. Absorbing the minus sign, we arrive at 
the following Quadratic Assignment Problem (QAP) with some arbitrary, sym- 
metric matrices A, B: 



(QAP) mjn tr(AXSV) . (1) 

3 Graduated assignment 

Gold and Rangarajan [6] and Ishii and Sato [10] independently developed a tech- 
nique commonly referred to as graduated assignment or soft assign algorithm. 
The set of permutation matrices B is replaced by the convex set V = £ f] of 
positive matrices with unit row and column sums (doubly stochastic matrices). In 
contrast to previous mean-field annealing approaches, the graduated assignment 
algorithm enforces hard constraints on row and column sums, making it usually 
superior to other deterministic annealing approaches. The core of the algorithm 
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is the following iteration scheme, where /? > 0 denotes the annealing parameter 
and the superscript denotes the iteration time step (for (3 fixed): 

, with = exp ^ (2) 

The scaling coefficients gi,hj are computed so that is projected on the 

set T> using Sinkhorn’s algorithm [6] as inner loop. 

This scheme locally converges under mild assumptions [15]. Several studies 
revealed excellent experimental results. In our experiments, we improved the ob- 
tained results with a local 2opt heuristics which iteratively improve the objective 
function by exchanging two rows and columns of the found permutation matrix 
until no improvement in the objective function is possible, as proposed in [10]. 

A drawback of this approach is that the selection of several “tuning” -parameters 
is necessary to obtain optimal performance, namely: 

— the parameter j3 related to the annealing schedule, 

— a “self-amplification” parameter enforcing integer values, and 

— two stopping criteria with respect to the two iteration loops in (2). 

Furthermore, the optimal parameter values vary for different problem instances 
(cf. [10]). For more details, we refer to [16]. 

4 Convex Approximations 

In this section, we discuss a convex approximation to the weighted graph-matching 
problem (1). For more details and proofs, we refer to [16]. 

As explained in Section 1, our motivation is twofold: Firstly, the need to se- 
lect parameter values (cf. previous section) is quite inconvenient when using a 
graph-matching approach as a part within a computer vision system. Convex 
optimization problems admit algorithmic solutions without any further param- 
eters. Secondly, we focus on problem relaxations providing lower bounds of the 
objective criterion (1), which then can be used in the context of exact search 
algorithms. 

4.1 Orthogonal relaxation and eigenvalne bonnds 

Replacing the set II hy O D U, Finke et al. [11] proved the following so called 
Eigenvalue Bound (EVB) as a lower lower bound of (1): 

{EVB) min tT {AX B X^) = {X{A)Yx{B) , (3) 

X 

with X{A),X{B) sorted such that Ai(A) > • • • > A„(A) and Ai(il) < • • • < Xn{B). 
This bound can be improved to give the Projected Eigenvalue Bound (PEVB) 
by further constraining the set of admissible matrices [17], but in contrast to the 
approach sketched in Section 4.3 this does not produce a matrix X for which the 
bound {PEVB) is attained. 
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4.2 The approach by Umeyama 

Based on (3), Umeyama [3] proposed the following estimate for the solution of 

( 1 ): 



Xume = argmaxtr(X* \U\ \V\*) . 



( 4 ) 



Here, U and V diagonalize the adjacency matrices A and B, respectively with the 
eigenvalues sorted according to (EVB) , and | • | denotes the matrix consisting of the 
absolute value taken for each element. (4) is a linear assignment problem which 
can be efficiently solved by using standard methods like linear programming. 

4.3 Convex relaxation 

Anstreicher and Brixius [12] improved the projected eigenvalue bound (PEVB) 
introduced in Section 4.1 to the Quadratic Programming Bound (QPB): 



(QPB) (A(A)) A(B) + ^mm^vec [Xj^Qvec [X] 



( 5 ) 



where A = P^AP, B = P*BP, with P being the orthogonal projection onto the 
complement of the ID-subspace spanned by the vector e, and where the matrix 
Q is computed as solution to the Lagrangian dual problem of the minimization 
problem (3) (see [12,16] for more details). Notice that both the computation of 
Q and minimizing (QPB) are convex optimization problems. Let X denote the 
global minimizer of (5). Then we compute a suboptimal solution to (1) by solving 
the following linear assignment problem: 



Xqpb = arg mjn tr(X*X). 
The bounds presented so far can be ranked as follows: 



(6) 



(EVE) < (PEVB) < (QPB) < (QAP) = mmiv(AXBX^) . (7) 

We therefore expect to obtain better solutions to (1) using (6) than using (4). 
This will be confirmed in the following section. 



5 Experiments 



We conducted extensive numerical experiments in order to compare the ap- 
proaches sketched in Sections 3 and 4. The results are summarized in the fol- 
lowing. Two classes of experiments were carried out: 

— We used the QAPLIB-library [13] from combinatorial mathematics which is 
a collection of problems of the form (1) which are known to be “particularly 
difficult”. 
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— Furthermore, we used large sets of computer-generated random graphs with 
sizes up to \V\ = 15 such that (i) the global optimum could be computed 
as ground-truth by using an exact search algorithm, and (ii) significant sta- 
tistical results could be obtained with respect to the quality of the various 
approaches. 

QAPLIB benchmark experiments. 

Table 1 shows the results computed for several QAPLIB-problems. The following 
abbreviations are used: 

QAP: name of the problem instance (1) taken from the library 

X*: value of the objective function (1) at the global optimum 

QPB: the quadratic programming bound (5) 

Xqpb- value of the objective function (1) using Xqpb from (6) 

Xqpb + - Xqpb followed by the 2opt greedy-strategy 
Xga- value of the objective function (1) using X from (2) 

Xga+- Xga followed by the 2opt greedy-strategy 
Xume- value of the objective function (1) using (4) 

Xume+- Xpme followed by the 2opt greedy-strategy 
The 2opt greedy-strategy amounts to iteratively exchanging two rows and columns 
of the matrix X as long as an improvement of the objective function is possible 
[ 10 ]. 

By inspection of table 1, three conclusions can be drawn: 

— The convex relaxation approach Xqpb and the soft-assign approach Xga 
have similarly good performance, despite the fact that the latter approach 
is much more intricate from the optimization point-of-view and involves a 
couple of tuning parameters which were optimized by hand. 

— The approach of Umeyama Xume based on orthogonal relaxation is not as 
competitive. 

— Using the simple 2opt greedy-strategy as post-processing step significantly 
improves the solution in most cases. 

In summary, these results indicate that the convex programming approach 
Xqpb embedded in a more sophisticated search strategy (compared to 2opt) is 
an attractive candidate for solving the weighted graph-matching problem. 

Random ground-truth experiments. 

We created many problem instances of (1) by randomly computing graphs. The 
probability that an edge is present in the underlying complete graph was about 
0.3. For each pair of graphs, the global optimum was computed using an exact 
search algorithm. 

Table 2 summarizes the statistics of our results. The notation explained in 
the previous Section was used. The first column on the left shows the problem 
size n together with the number of random experiments in angular brackets. 
The number pairs in round brackets denote the number of experiments for which 
the global optimum was found with/ without the 2opt greedy-strategy as a post- 
processing step. Furthermore the worst case, the best case, and the average case 
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QAP 


X* 


QPB 


Xqpb 


Xqpb+ 


Xga 


Xga+ 


^Ume 


^Ume 


chrl2c 


11156 


-22648 


20306 


15860 


19014 


11186 


40370 


11798 


chrl5a 


9896 


-48539 


26132 


14454 


30370 


11062 


60986 


17390 


chrlSc 


9504 


-47409 


29862 


17342 


23686 


13342 


76318 


13338 


chr20b 


2298 


-7728 


6674 


2858 


6290 


2650 


10022 


3294 


chr22b 


6194 


-20995 


9942 


6848 


9658 


6732 


13118 


7418 


esclGb 


292 


250 


296 


292 


298 


292 


306 


292 


roul2 


235528 


205461 


278834 


246712 


273438 


246282 


295752 


251848 


roul5 


354210 


303487 


381016 


371480 


457908 


359748 


480352 


384018 


rou20 


725522 


607362 


804676 


746636 


840120 


738618 


905246 


765872 


tailOa 


135028 


116260 


165364 


143260 


168096 


135828 


189852 


147838 


tail5a 


388214 


330205 


455778 


399732 


451164 


400328 


483596 


405442 


tail7a 


491812 


415578 


550852 


513170 


589814 


505856 


620964 


526814 


tai20a 


703482 


584942 


799790 


740696 


871480 


724188 


915144 


775456 


taiSOa 


1818146 


1517829 


1996442 


1883810 


2077958 


1886790 


2213846 


1875680 


tai35a 


2422002 


1958998 


2720986 


2527684 


2803456 


2496524 


2925390 


2544536 


tai40a 


3139370 


2506806 


3529402 


3243018 


3668044 


3249924 


3727478 


3282284 



Table 1. Results of the QAPLIB benchmark experiments (see text). 



for the relative values for each of the three estimates presented in Sections 3 and 
4 are shown (note that these values are smaller than 1 because the value of the 
objective function (1) is negative for this class of experiments). In summary, the 
conclusions with respect to the QAPLIB-experiments are confirmed. 







^qpd/X* 








Xca/X* 




mean 


[worst case 


[best case 


[mean worst case 


[best case 


Imean 


1 worst case 


Ibest case I 


n=9 [128] 


(22/53) 


(7/29) 


(31/55) 




0.87607 


0.43552 


1 


0.638244 0.0651729 


1 


.948342 


.7756129 


1 


2opt 


0.966155 


0.79256 


1 


0.928304 0.753007 


1 


.9699138 


.843046 


1 


n=ll [42] 


(3/11) 


(0/7) 


(7/10) 




0.824023 


0.514964 


1 


0.636159 0.295194 


0.998591 


.940740 


.8338586 


1 


2opt 


0.962258 


0.842204 


1 


0.933206 0.811326 


1 


.9588626 


.8434407 


1 


n=15 [99] 


1 (0/5) 


1 (0/1^ 1 


1 (47TP 1 




0.741563 


0.232741 


0.938917 


0.131333 0.225983 


0.863508 


.916225 


.105164 


1 


2opt 


0.925801 


0.777494 


1 


0.890131 0.74688 


1 


.9576297 


.8205957 


1 



Table 2. Statistics of the results of random ground-truth experiments (see text). 

6 Conclusion 

We have shown that, based on advanced techniques from convex optimization 
theory, suboptimal solutions to the weighted graph-matching problem can be 
computed which are competitive with respect to recent deterministic annealing 
approaches. In contrast to annealing approaches, however, the convex approach 
exhibits two favorable properties: Firstly, no tuning parameters are needed. Sec- 
ondly, it computes a lower bound and thus can be used as a subroutine within 
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an exact search strategy like branch-and-bound, for example. As a result, it is an 
attractive candidate for solving matching problems in the context of view-based 
object recognition. 

Acknowledgment: We are thankful for discussions with Prof. Dr.-Ing. W. Forstner, 
D. Cremers and J. Keuchel. 
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Abstract. This paper addresses the problem of learning shape models 
from examples. The contributions are twofold. First, a comparative study 
is performed of various methods for establishing shape correspondence - 
based on shape decomposition, feature selection and alignment. Various 
registration methods using polygonal and Fourier features are extended 
to deal with shapes at multiple scales and the importance of doing so is 
illustrated. Second, we consider an appearance-based modeling technique 
which represents a shape distribution in terms of clusters containing 
similar shapes; each cluster is associated with a separate feature space. 
This representation is obtained by applying a novel simultaneous shape 
registration and clustering procedure on a set of training shapes. We 
illustrate the various techniques on pedestrian and plane shapes. 



1 Introduction 

For many interesting object detection tasks there are no explicit prior models 
available to support a matching process. This is typically the case for the detec- 
tion of complex non-rigid objects under unrestricted viewpoints and/or under 
changing illumination conditions. In this paper we deal with methods for ac- 
quiring (i.e. ’’learning”) shape models from examples. Section 0 reviews existing 
methods for establishing shape correspondence and modeling shape variation. 
Shape registration methods bypass the need for tedious manual labeling and 
establish point correspondences between shapes in a training set automatically. 
Although a sizeable literature exists in this area (e.g. 0 0 0 0 0 ), there 
has been little effort done so far in comparing various approaches. We perform 
a comparative study on various shape registration methods in Section 0 and 
demonstrate the benefit of describing shapes at multiple scales for this purpose. 
The best performing registration method is combined with a clustering algo- 
rithm to describe arbitrary shape distributions in terms of clusters containing 
similar shapes; each cluster is associated with a separate feature space, see Sec- 
tion 0 Benefits of such representation are discussed in Section 0 after which we 
conclude in Section 0 
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2 Review 

2.1 Shape Registration 

A review of previous work on shape registration shows that a typical procedure 
follows a similar succession of steps: shape decomposition, feature selection, point 
correspondence and finally, alignment. 

The first step, shape decomposition, involves determining control (’’land- 
mark”) points along a contour and breaking the shape into corresponding seg- 
ments. One way of doing this is to consider the curvature function along the 
object’s contour and to determine the locations of the minima and maxima. The 
curvature function is computed by convolving the edge direction function of a 
contour point with the first derivative of a Gaussian function. The parameter a 
in the Gaussian function determines the smoothing of a shape, see for example 
p ca ESI. A different method for determining control points is described in 
where points are removed according to a criticality measure based on the 
area of three successive points. 

Once shape segments have been determined, the next step involves selecting 
features that are transformation invariant (e.g. translation, rotation and scale). 
These features are usually selected based on an approximation of the shape 
segments; most common are straight-line (i.e. polygonal) i p Q nn and 
Fourier approximations |7]. Similarity measures are typically based on length 
ratios and angles between successive segments for the polygonal case (e.g. P) 
and weighted Euclidean metrics on the low-order Fourier coefficients, for the 
Fourier approximations (e.g. (Z)). 

At this point, correspondence between the control points of two shapes can 
be established by means of either combinatoric approaches or sequential pattern 
matching techniques. Each correspondence is evaluated using match measures 
on the local features discussed earlier. Gombinatoric approaches PI P select 
iteratively a (e.g. minimal) set of initial correspondences and use various greedy 
methods to assign the remaining points. The advantage of this approach is that 
the overall goodness of a match can be based on all contour points simultane- 
ously. Additional effort is necessary, though, to account for ordering constraints. 
Sequential pattern matching techniques, on the other hand, inherently enforce 
ordering constraints. Dynamic programming is widely used U IZl since it is an 
efficient technique to come up with optimal solutions for the case of an evaluation 
measure which has the Markov property. 

Finally, after correspondence between control points has been established 
(and possibly, between interpolated points), alignment with respect to similarity 
transformation is achieved by a least-squares fit [21. The above techniques for 
shape registration have important applications for partial curve matching in 
object recognition tasks Q El EH Next subsection deals with their use for 
modeling shape variation. 
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2.2 Modeling Shape Variation 

Registration establishes point correspondence between a pair of shapes. The 
straightforward way to account for N shapes is to embed all N shapes in a 
single feature vector space, based on the x and y locations of their correspond- 
ing points. This is done either by selecting one shape (typically, the ’’closest” 
to the others) and aligning all others to it, or by employing a somewhat more 
complex hierarchical procedure 0. The resulting vector space allows the com- 
putation of various compact representations for the shape distribution, based on 
radial (mean-variance) 0 or modal (linear subspace) decompositions . 

Combinations are also possible |H|. 

The assumption that point correspondence can be established across all train- 
ing shapes of a particular object class by means of automatic shape registration 
is in many cases problematic, however. For example, none of the shape regis- 
tration methods we analyzed were able to correctly register a pedestrian shape 
with the two feet apart with one with the feet together, without operator input 
or prior knowledge. A more general registration approach does not forcibly em- 
bed all N shapes into the same feature vector space. Instead, it combines shape 
clustering and registration, embedding only the (similar) shapes within a cluster 
into the same vector space. This is the approach pursued in Section Ei adapted 
from earlier work |3|. 

Finally, some approaches i] PI do not try to embed shapes into a vector space 
altogether; a hierarchical representation is built solely on the basis of pairwise 
dissimilarity values. See also Section El and mu. 

3 Shape Registration 

In this study, we jointly compare the performance of control point detection, fea- 
ture extraction and matching algorithms for the purpose of shape registration. 
Under consideration are two techniques for control point detection, the first is 
the Gaussian-based filtering of the curvature function m and the second is the 
critical point detection algorithm described in m We furthermore compare two 
techniques for feature selection and similarity measurement: one based on polyg- 
onal approximations (D] and one based on piecewise Fourier decompositions |Z|. 
We choose as matching algorithm invariably dynamic programming, because of 
its efficiency and optimality property. This leads to a total of four methods under 
consideration. The experiments involve computing pairwise alignments between 
all elements of a particular shape data set and recording the mean Euclidean 
distance between corresponding contour points after alignment, i.e. the mean 
alignment error. Dense point correspondences are inferred by interpolating the 
correspondences between the control points. 

Our study furthermore extends previous shape registration work 0 HU by 
representing shapes at multiple scales. For the method of HU, this is achieved 
by using multiple Gaussian cr values (e.g. HU m), whereas in (H this involves 
using different area criteria. In the experiments, we compute shape registrations 
for all scales and consider the one which minimizes the mean alignment error. 
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Two data sets were used to evaluate the four method combinations, a set 
of 25 plane contours (height 100 - 200 pixels) and a set of 50 pedestrian con- 
tours (height 80 pixels). See Figure ^ Figure[3D illustrates the multi-scale shape 
representation resulting from the Gaussian filtering of the curvature function. 
Figure |2b shows the effects of varying the number of Fourier coefficients, at one 
particular scale. 

The importance of maintaining shape representations at multiple scales is 
demonstrated in Figure 0 The histogram shows the number of alignments (y- 
axis) at which a particular scale cr-interval (x-axis) results in the best pairwise 
alignment in terms of minimizing the mean alignment error. This particular 
data pertains to the registration method that combines control point detection 
by Gaussian curvature filtering and the use of Fourier features for the pedestrian 
data set. From the Figure one observes that a wide range of scales are utilized for 
the ’’best” alignment; a registration approach that would only use a single scale 
representation (i.e. choosing a single bar of the histogram) would be non-optimal 
for most of the shape alignments considered. 

Some typical point correspondences and alignments are shown in Figure 0| 
As can be seen, the registration method is quite successful in pairing up the 
physically corresponding points (e.g. the tip and wings of the planes, the heads 
and feet of the pedestrians). 

t » A + -V 

I ♦ < F A' +* 

> f ♦ A- 

(a) (b) 

Fig. 1. Example (a) pedestrian shapes (b) plane shapes. 



Figure 0 summarizes the results of the shape registration experiments. It 
shows the cumulative distribution of the mean alignment error for the plane 
(Figure Et) and pedestrian (Figure 0 d) data sets, for the four combinations 
analyzed and based on 600 and 2450 pairwise alignments, respectively. For ex- 
ample, about 80% of the pedestrian alignments resulted in a mean alignment 
error smaller than 9 pixels (on a pedestrian size of 80 pixels) for method c-, 
Gaussian curvature filtering with polygonal features. More revealing is relative 
performance of the methods. From the Figure one observes that Gaussian filter- 
ing of the curvature function provides a better multi-scale shape representation 
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Fig. 2. Multi scale representation (a) original shape (b) polygonal approximations 
(decreasing a value from left to right) (c) Fourier approximations (increasing number 
of coefficients used from left to right) 




Fig. 3. Histogram showing number of alignments (y-axis) at which a particular scale 
CT-interval (a;-axis) results in the best pairwise alignment, using Gaussian curvature 
filtering 



than that derived from the critical point detection algorithm (compare graphs 
a- and c- versus graphs b- and d-), at least as far as our implementation is 
concerned. Also, the Fourier features proved to be more suitable in dealing with 
the curved shapes of our data sets (compare graphs a- and b- versus graphs c- 
and d-). 



4 Shape Clustering 

After the comparative study on shape registration, we used the best performing 
method (i.e. Gaussian curvature filtering for control point detection and Fourier 
coefficients as features) to embed the N shapes of the training set into feature 
spaces. As mentioned in Subsection 12.21 for many datasets it is not feasible 
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(c) (d) (c) (d) 



Fig. 4. Established point correspondences between two (a) planes and (b) pedestrians 
and their pairwise alignment, (c) and (d). Aligned shapes are superimposed in grey. 




Fig. 5. Cumulative distribution of the mean alignment error for the (a) plane and 
(b) pedestrian data of the four methods analyzed (a- Gaussian curvature filtering ^3] 
multi-scale with Fourier features Q, b- “critical point” m multi-scale with Fourier 
features U|, c- Gaussian curvature filtering HH multi-scale with polygonal features El, 
and d- “critical point ” El multi-scale with polygonal features El)- The horizontal 
axis is in units of pixels. Object size 100 - 200 pixels for (a) and 80 pixels for (b) 



to map all N shapes onto a single feature space because of considerable 
shape differences. Therefore, we follow a combined clustering and registration 
approach which establishes shape correspondence only between (similar) shapes 
within a cluster. The clustering algorithm is similar to the classical K-means 
approach: 

0. pick an initial shape Si and add it to cluster C\ 

as prototype: Ci = {5'i}, P\ = Si 
while there are shapes left do 

1. select one of remaining shapes: Sk 

2. compute mean alignment error d{Sk,Pi) from 
element Sk to the existing prototypes Pi, 
where i ranges over the number of clusters 
created so far 

3. Compute = d{Sk,Pj) = minid{Sk,Pi)- 
if d{Sk,Pj)>0 
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then assign Sk to a new cluster Cn+i- 
Cn+l = {Sk}, Pn+1 = Sk 
else assign Sk to existing cluster 

Cj = {S'ji, ...Sjn} and update its prototype: 

Cj = C, U {Sk} 

Pj = Mean(S'ji, Sjn, Sk) 

end 

In the above, Step 2 consists of applying our best performing shape registration 
method (see previous Section) to establish point correspondences and compute 
alignment errors. The resulting point correspondences are used for the prototype 
computation in Step 3. See Figure Et- Parameter 0 is a user-defined dissimilarity 
threshold that controls the number of clusters to be created. Compared to |3j, the 
above formulation has the advantage that it does not require the computation 
of the full dissimilarity matrix d{Si,Sj). It furthermore adjusts the prototype 
whenever new elements are assigned to a group. FigureEb illustrates some typical 
clustering results. 




AAA/tAil 
H i H i 
At AAAI 
ff )t M 

(b) 



Fig. 6. (a) Computing a “mean” or prototype shape (in black) from aligned shape 
samples (in grey) (b) Shape clustering: each row contains the elements of a different 
cluster 



5 Outlook 

Learned shape models can be used for matching and tracking purposes. We are 
currently utilizing the approach described in previous Section for learning pedes- 
trian shapes and improving the performance of the Chamfer System a sys- 
tem for shape-based object detection. At its core, the system performs template 
matching using distance transforms. It has an off-line phase where a template 
hierarchy is built by clustering shapes recursively; clustering is solely based on 
dissimilarity values PJ. Here, the followed shape registration and modeling ap- 
proach establishes feature spaces which facilitate the computation of improved 



376 



D.M. Gavrila, J. Giebel, and H. Neumann 



shape prototypes, i.e. the arithmetic mean of the shape cluster elements. Fur- 
thermore, the presence of a feature space enables standard data dimensionality 
reduction techniques which can used for the generation of ’’virtual” shapes, en- 
riching the training set of the Chamfer System. 

6 Conclusions 

This paper performed a comparative study of various methods for shape regis- 
tration and identified a multi-scale method based on Gaussian curvature filtering 
and Fourier features as best performing one. This registration step was incorpo- 
rated in a shape clustering algorithm which allows the representation of arbitrary 
shape distributions in terms of clusters of similar shapes, embedded in separate 
feature spaces. 
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Abstract. This paper presents a novel strategy for rapid reconstruction of 3d 
surfaces based on 2d gradient directions. I.e., this method does not use triangu- 
lation for range data acquisition, but rather computes surface normals. These 
normals are 2d integrated and thus yield the desired surface coordinates; in ad- 
dition they can be used to compute robust 3d features of free form surfaces. The 
reconstruction can be realized with uncalibrated systems by means of very fast 
and simple table look-up operations with moderate accuracy, or with calibrated 
systems with superior precision. 



1. Introduction 

Many range finding techniques have been proposed in the open literature, such as 
stereo vision [1], structured light [2, 3,4, 5], coded light [6], shape from texture [7,8], 
shape from shading [9], etc.. Similar to some structured light methods, our suggested 
novel approach for 3d surface reconstruction requires one camera and two light stripe 
projections onto an object to be reconstructed (Fig. 1). However: Our reconstruction 
technique is based on the fact, that angles of stripes in the captured 2d image depend 
on the orientation of the local object surface in 3d. Surface normals and range data 
are not obtained via triangulation, like in the case of most structured light or coded 
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light approaches, but rather by computing the stripe angles of the 2d stripe image by 
means of gradient directions. Each stripe angle determines one degree of freedom of 
the local surface normal. Therefore, the total reconstruction of all visible surface nor- 
mals requires at least two projections with rotated stripe patterns relative to each other. 
At first glance, an almost similar method using a square pattern has been presented in 
[10]; in contrast to our approach it requires complex computation, e.g. for detection of 
lines and line crossing and for checking of grid connectivity. Moreover, it utilizes a 
lower density of measurement points yielding a lower lateral resolution. 







range data 3d-features 
(see Figures 7 and 8) 

Fig. 2. Processing steps of the ‘Shape form 2D Edge Gradients’ approach 
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2. Measurement Principle 

The surface reconstruction can be subdivided into several functional steps (see Fig. 2). 
First, we take two grey level images of the scene illuminated with two differently 
rotated stripe patterns. Undesired information, like inhomogeneous object shadings 
and textures are eliminated by an optional appropriate preprocessing (Section 3). Sub- 
sequently, local angles of stripe edges are measured by a gradient operator. This leads 
to two angle images, which still contain erroneous angles and outliers which have to 
be extracted and replaced by interpolated values in an additional step (Section 4). On 
the basis of two stripe angles at one image pixel we calculate the local 3d surface 
slope or surface normal (Section 5). The surface normals can be used to reconstruct 
the surface itself, or they can be utilized as basis for 3d feature computation (Section 
6 ). 



3. Preprocessing 




Fig. 3. Preprocessing steps of a textured test object, (a) object illuminated with stripe pattern; 
(b) object with ambient light; (c) object with homogeneous projector illumination; (d) absolute 
difference between (a) and (b); (e) absolute difference between (b) and (c); (f) normalized stripe 
image; (g) binary mask 








380 



S. Winkelbach and F.M. Wahl 



In order to be able to detect and analyse the illumination stripe patterns in grey level 
images, we apply an adapted optional preprocessing procedure, which separates the 
stripe pattern from arbitrary surface reflection characteristics of the object (color, 
texture, etc.)- Fig. 3 illustrates this procedure; After capturing the scene with stripe 
illumination (a) we acquire a second image with ambient light (b) and a third one with 
homogenous illumination by the projector (c). Using the absolute difference (d = 
la-bl) we scale the dark stripes to zero value and separate them from object color. 
However, shadings and varying reflections caused by the projected light still are re- 
tained in the bright stripes. For this reason we normalize the stripe signal (d) to a con- 
stant magnitude by dividing it by the absolute difference (e) between the illuminated 
and non-illuminated image (f = d / e). As can be seen from Fig. 3f, noise is intensified 
as well by this division; it gets the same contrast as the stripe patterns. This noise 
arises in areas, where the projector illuminates the object surface with low intensity. 
Therefore we eleminate it by using the mask ( g = 1 if (e > threshold), g = 0 else ). 



4. Stripe Angle Determination 



After preprocessing the stripe images, determination of stripe angles can take place by 
well-known gradient operators. We evaluated different gradient operators like Sobel, 
Canny, etc. to investigate their suitability. Fig. 4 shows the preprocessed stripe image 
of a spherical surface in (a), gradient directions in (b) (angles are represented by dif- 
ferent grey levels); in this case the Sobel operator has been applied. Noisy angles arise 
mainly in homogenous areas where gradient magnitudes are low. Thus low gradient 
magnitudes can be used to eliminate erroneous angles and to replace them by interpo- 
lated data from the neighborhood. For angle interpolation we propose an efficient 
data-dependent averaging interpolation scheme: A homogeneous averaging-filter is 
applied to the masked grey level image and the result is divided by the likewise aver- 
age-filtered binary mask (0 for invalid and 1 for valid angles). In this way one obtains 
the average value of valid angles within the operator window. Finally, erroneous an- 
gles are replaced by the interpolated ones. 




Fig. 4. Computation of angle images, (a) Stripe image; (b) Sobel gradient angles (angle image); 
(c) masked angle image by means of gradient magnitude; (d) interpolated angle image 
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5. Surface Normal Computation 



object 





Fig. 5. Projection of a stripe on an object 

We investigated two methods to compute surface slopes (or normals) from stripe 
angles: A mathematical correct computation with calibrated optical systems (camera 
and light projectors) on the one hand and on the other hand a simple table look-up 
method which maps two stripe angles to one surface normal. Due to the limited space 
of this paper, we only will give a general idea of the mathematical solution: The angle 
values (Wj, a\ of both rotated stripe projections and their 2d image coordinates specify 
two “camera planes” and corresponding normals c,, (see Fig. 5). The 

tangential direction vector v, of a projected stripe on the object surface is orthogonal to 
C| and orthogonal to the normal p of the stripe projection plane. So we can use the 
following simple equation to calculate the surface normal: m = (q X p^)x(c 2 X pj) 

Generation of the look-up table in the second approach works in a similar way like the 
look-up table based implementation of photometric stereo [9] : 




Fig. 6. Computation of the look-up table with use of stripe projections onto a spherical surface 
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By means of the previously described processing steps, we first estimate the stripe 
angles of two stripe projections rotated relatively to each other onto a spherical surface 
(see Fig. 6). As the surface normals of a sphere are known, we can use the two angle 
values co^, (ti, at each surface point of the sphere to fill the 2d look-up table with (O^ and 
a\ as address. Subsequently, missing values in the table are computed by interpolation. 
Now the look-up table can be used to map two stripe angles to one surface normal. To 
compute range data from surface slopes we applied the 2d integration method 
proposed by Frankot/Chellappa [11]. 



6. Experimental Results and Conclusions 

The following images illustrate experimental results of the look-up table approach 
proposed above. 




Fig. 7. Depths data and plane features of a test cube: (a) surface x-gradient, (b) surface y- 
gradient, (c) range map with corresponding line profiles; (d) histograms of (C 0 pC 0 ,)-tuples 
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Fig. 7 shows the gradients of a cube and plots of sample lines. The ideal surface gradi- 
ents are constant within every face of the cube. Thus, accuracy of measurement can be 
evaluated by the standard derivation within a face, which in our first experiments 
shown here is approximately 0.4109 degree. Errors are reduced after integration (see 
Fig.7(c)). Fig.7(d) shows the 2d histogram of the gradient direction tuples ry,, co^ , 
corresponding to the three faces of the cube. 

Fig. 8 shows the 2d grey level image of a styrofoam head, its reconstructed range map 
and its corresponding rendered grey level images from two different viewing direc- 
tions. Using more than two stripe projections from different illumination directions 
can improve the reconstruction result. 




Fig. 8. Reconstruction of a styrofoam head: (a) grey level image; (b) reconstructed range map; 
(c and d) corresponding rendered grey level images from two different viewing directions ob- 
tained by virtual illumination in 3d 

In many applications surface normals are an important basis for robust 3d features, as 
for example surface orientations, relative angles, curvatures, local maxima and saddle 
points of surface shape. An advantage of our approach is, that it is very efficient and 
directly generates surface normals without the necessity deriving them subsequently 
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from noisy range data. In case of textured and inhomogeneously reflecting objects our 
technique offers a higher robustness in comparison to most other methods. 

Due to the limited space of this paper, we only have been able to present a rough 
outline of our new approach. Discussions about important aspects like a detailed error 
analysis, applications of our technique, etc. are subject of further publications [12]. 
Regarding accuracy, we should mention, that in the experiments shown above, we 
used optical systems with long focal lengths. Although our experiments are very 
promising, the new technique is open for improvements. E.g., the reconstruction time 
can be reduced by color-coding the two stripe patterns and projecting them simulta- 
neously. Acquiring serveral stripe images with phase shifted stripe patterns could 
increase the number of gradients with high magnitudes, thus reducing the need for 
replacing erroneous gradient directions hy interpolation. An alternative augmented 
technique uses only one stripe projection for full 3d surface reconstruction [13]. This 
is possible by determining the stripe angles and the stripe widths as source of infor- 
mation to estimate surface orientations. 
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Abstract. In this paper a new approach for video based obstacle detection for a 
mobile robot is proposed, based on probabilistic evaluation of image data. Apart 
from the measurement data, also their uncertainties are taken into account. 
Evaluation is achieved using Kalman filter technique combining the results of 
video data processing and robot motion data. Obstacle detection is realised by 
computing obstacle probability and subsequent application of a threshold 
operator. The first experiments show remarkably stable obstacle detection. 

Keywords: Kalman filter, obstacle detection, stereo based vision. 



1. Introduction 



Mobile service robots can be used for different tasks in industrial environments where 
highly reliable and robust navigation is required. One of the main tasks of such a 
mobile robot system is to detect moving and non-moving obstacles even if the 
(lighting) conditions around the robot change. 

Kalman filter is a popular method to solve localisation tasks for mobile robots. An 
overview is given e.g. in [3]. In [1] the Kalman filter performs the data fusion of 
results of image processing algorithms and the laser scanner data, dependent on their 
accuracies, which allows to determine the robot position more accurately. A relatively 
new method is proposed in [3]: The authors used a robust probabilistic approach 
called Markov localisation to determine the robot position utilising sonar sensor data 
or laser scanner data. 

In previous work [6] a stereo based vision system was presented, which enables 
obstacle detection using a robust method for disparity estimation. Even if the system 
yields good results, there could be situations, where some phantom objects appear, 
that is, objects, which do not exist in reality. This can have different reasons: e.g. 
reflections on the ground where the robot moves, camera blinding by a light beam, or 
noise. Because these objects do not exist in reality, they normally disappear in the 
next image pair after the robot moves. It seems possible to eliminate these effects by 
using information about robot movement and by evaluation of subsequent images. 

In this paper we present a method that uses Kalman filter techniques to robustly 
combine results of 3D-reconstruction from different image pairs, taken at different 
robot positions. Hereby the uncertainties of robot motion are taking into account as 
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well as the uncertainties of the 3D-points that are measured from different image 
pairs. The description of this method including error propagation can be found in 
chapter 2. 

In chapter 3, a probabilistic interpretation of each 3D-point including its 
uncertainty in form of the covariance matrix is made to determine the probability that 
at a certain area in front of the robot is an obstacle, whereby a similar idea is used as 
described in [3]. The difference is that in [3] the probabilistic interpretation of the 
sensor data was performed to determine the position of a mobile robot. 

The experimental results are given in chapter 4. Finally the paper will conclude 
with a short summary in chapter 5. 



2. Kalman Filter 



The well-known equations of the Kalman filter are not given here, the interested 
reader is referred to introductory paper [8] or to book [4]. 

First, the prediction step will be considered (see Fig. 1). Given an estimate of a 

Robot motion Image data 




Fig. 1. Block diagram of Kalman filtering and probabilistic obstacle detection 



3D-point computed after the image pair k was evaluated, the prediction of 
Fr > Zj, to the robot position where the image pair k+\ is grabbed, is done 
as follows (process equation); 

x;^+i = R(/?^)(x^ -t^), (1) 

with the rotation matrix R(/?i) that describes the change of robot orientation 
by an angle ySj, (rotation around the robot y-axis) and the translation vector 



= (f Y (see Fig. 2). It is assumed that the robot moves on the xz-plane, so 
the motion does not change the y-coordinate ( =0 ). 




Both, the rotation angle (5^ and the 
translation vector are assumed to be 
affected by normally distributed non- 
correlated errors. The error variance of 

is denoted by the error of is 

described by its 3x3 covariance matrix C,^ 



Fig. 2. Representation of robot motion 



Robust Obstacle Detection from Stereoscopic Image Sequences 387 



and the uncertainty of the point hy the 3x3 covariance matrix . Because 
tyi^ = 0 , the variances and covariances in C^l^ that are related to ty^. are 0. 

To determine the covariance matrix of x^^j , the prediction (1) is considered 
as a function of 7 parameters: x^+j = x^+j ,t^^,tyj^,t^^,Xi^,yi^,Zk) . Then, P^+j is 

computed by means of Gaussian error propagation technique: P^^j = TCT^ [2] with 
the covariance matrix C of the parameter vector and the Jacobian matrix T: 




Further simplification can be made: because of assumed independence of errors and 
the assumption that depends only on y^. , it can be shown that the influence of 
, C^|^ , and Pjj, on the covariance matrix of prediction can be considered 
separately: 



3x 



^ k+l 






^+1 2 









+ R(;5J-(C,,+PJ-R^(/?J. 



In the correction step (see Fig. 1), the actual measurement is used to correct the 



prediction x^^ . Here, the index A:+l is replaced by k to simplify the notification and to 
underline the iterative nature of the process. To get from the actual image pair, the 
disparity estimation is performed using a stochastical method as described in [5]. The 
advantage of this method is that beyond the disparity itself, its reliability can be 
estimated as the variance of disparity. From the disparity, the 3D-point is computed 
using parameters of the calibrated stereo camera rig. The correspondent covariance 
matrix is computed by means of the covariance propagation using the variances and 
covariances of the camera parameters and the disparity variance, see [2]. Using the 
actual measurement and its covariance matrix, the measurement update is computed 
as follows [8]: 

• compute Kalman gain: (P^T + )~* , 

• correct the prediction x^ using measurement z^: Xj. = x^+Kj.(Zj(,-x^), (2) 

• update the covariance matrix of estimate: P^ = (I - )P^ 

(I denotes the unity matrix). 

As already mentioned, each 3D-point as well as its covariance matrix have to be 
„tracked“ over an image pair sequence separately from other points. The „tracking“ is 
achieved in the following way: Assume that for image pair k at some position a 
disparity was measured, that results in 3D-point z^. After computing the point x^ 

with equations (2), the prediction x^^j is made using the robot movement data, 
equation (1). This predicted 3D-point is then projected into image pair k+\, which 
results in a certain predicted disparity. The actual disparity is measured at the 
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projected position and the correspondent 3D-point is reconstructed. The next estimate 
is then calculated dependent on the uncertainty of the prediction and on the 
uncertainty of the actual measurement. 

3. Probabilistic Obstacle Detection 

The goal of obstacle detection for a mobile 
robot that moves on a plane is to detect 
obstacles and to determine their position on 
the plane. For this purpose a grid is laid on 
the xz-plane where the robot moves. The task 
of the obstacle detection procedure is to mark 
those elements of the grid where an obstacle 
is detected. In the following, a probabilistic 
method for marking of obstacle areas is 
proposed. 

The described Kalman filtering requires 
Gaussian probability distribution of errors. In 
this case, the probability distribution of a 
point X = (x, y, zY is completely determined by its estimated value x = (x, y, zY and 
the covariance matrix P . The probability density function (PDF) is given by 

f Ax - S)1 . 

Normally, an obstacle edge consists of many 3D points that are found over a 
comparable small area on the plane. Each point is detected with its own uncertainty in 
image data. For some points this results in PDFs, that are broadly distributed over 
many grid elements, and for other points in PDFs that are distributed over only a few 
grid elements. The problem is to combine these different PDFs in an appropriate way 
to make the marking of obstacle areas possible. 

In a first step, only one point, tracked as described above and the corresponding 
PDF will be considered (Fig. 3). The PDF with the expected value x and the 
covariance matrix P is shown with an ellipsoid. The projection of this PDF on the xz- 
plane is represented by ellipses on this plane and gives an idea where the obstacle area 
should lie. The projection is done by integration of the PDF: 

pixi ,zj) = * • 

Here, p{x., z) denotes the probability, that the estimated point x lies over the grid 
element between [x., and [Zj, Zj^J. Using the threshold the influence of the 
points that could be estimated with small certainty is reduced. 

If there are altogether N points, lying over the grid element, the probabilities 
p^{xi,Zj ) , n = of all points are accumulated. With this method, the number 
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of the points over a grid element becomes important as well as their precision is taken 
into account. 



4. Experimental Results 

The presented algorithm have been implemented on a standard PC using the C 
programming language. The experiments were carried out with a mobile robot from 
the company „G6tting KG“ which is equipped with odometry sensors that are 
mounted on the driving shaft. From the odometric data the robot motion is computed 
and transmitted to the PC. To determine the uncertainty of the robot motion, the robot 
was moved and its actual position and orientation were compared with odometric 
data. The uncertainties of translation were measured as 5% in perpendicular direction 
of robot motion and 2% parallel to it; the uncertainty of rotation was 0.05 radian. 

As stereoscopic video sensor two synchronised monochrome CCD-cameras have 
been used. They are mounted 60 cm over the floor plane and have a distance of about 
12 cm from each other. The cameras were calibrated using the Tsai-method [7] with a 
precision of 3D-reconstruction of 3-5 % up to a distance of 5 m. The processing of the 
video data has been performed as described in [6] with the goal to perform the 
3D-reconstruction of the scene. 

To investigate the results of 3D-points tracking by means of Kalman filtering, the 
video sensor has been applied in a static environment. In each image pair at the same 
image position, the disparities were estimated with slightly different disparity 
variances due to image noise. Using the variances and covariances of camera 
parameters and the disparity variance, the covariance matrices of corresponding 
3D-points are computed (k denotes the image number). Due to relatively short 
distance between the cameras, the resulting uncertainty in the z-direction, , is much 
bigger than the uncertainties in x- and y-direction, a ^ and a ^ . To simplify the 

interpretation of the 
results, it was assumed 
that the values of a . and 

xk 

are equal to the value 
of (worst case) and 

non-diagonal elements 
are set to 0. This also 
results in diagonal shape 
of Kj and after 

performing the Kalman 
filtering, formulas (2), so 
that the PDFs of 

corresponding 3D-points 
are radial-symmetric. 

Fig. 4 shows the 

computed variances 
(diagonal elements of 
Fig. 4 . Results of Kalman filtering matrices ) and the 

Kalman gain from 10 
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Fig. 5. Results of probabilistic obstacle detection 



image pairs. Because the 3D-point was not disturbed by the robot motion, the 3D- 
variance as well as the Kalman gain decreases continuously, as expected. 

Fig. 5 a) shows the left image of an image pair. The results of the projection of 
PDFs of single 3D-points with subsequent accumulation as described in chapter 3 are 
represented in Fig. 5 b). The figure shows the „bird’s-eye view“ of the scene in front 
of the robot, which is placed at the position denoted by the coordinate cross. The grey 
scale value of a point in the picture represents the logarithm of the probability, that 
there is an obstacle. The dark pixels represent higher probability compare to the light 
ones. Fig. 5 c) shows the results of the 3D-reconstruction without any probabilistic 
data interpretation. Comparing Fig. 5 b) and 5 c), it can be seen that most outliers 
could be successfully suppressed (some outliers are marked by ellipses in Fig. 5 c). 
The final step of obstacle detection has been made by marking those points whose 
probability exceeds a certain threshold (in the Fig. 5 b) denoted by rectangles). For 
testing purposes the obstacles were projected back in the left image (in Fig. 5 a) 
denoted by ellipses). 



5. Summary 

Mobile robot applications require robust and reliable 3D-reconstruction and obstacle 
detection. In the presented paper, the Kalman filter technique was used to robustly 
combine measurements from different image pairs. The experimental results show the 
ability of the described method to perform the fusion of robot motion data with results 
of stereoscopically based 3D-reconstruction. 
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Furthermore, a stochastically motivated method was developed to carry out the 
obstacle detection. It could be shown experimentally that this method can be used for 
the outlier elimination in reconstructed 3D-data and in this way for suppression of the 
phantom objects. 

Further developments will concentrate on course generation to allow the robot to 
move around obstacles. Another future task is to develop land mark based global 
localisation technique for mobile robot using odometry and video data. A promising 
approach for this is to apply the Markov localisation procedure used in [3]. 
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Abstract. We consider 3D object retrieval in which a polygonal mesh 
serves as a query and similar objects are retrieved from a collection of 3D 
objects. Algorithms proceed first by a normalization step in which mod- 
els are transformed into canonical coordinates. Second, feature vectors 
are extracted and compared with those derived from normalized models 
in the search space. In the feature vector space nearest neighbors are 
computed and ranked. Retrieved objects are displayed for inspection, se- 
lection, and processing. Our feature vectors are based on rays cast from 
the center of mass of the object. For each ray the object extent in the 
ray direction yields a sample of a function on the sphere. We compared 
two kinds of representations of this function, namely spherical harmonics 
and moments. Our empirical comparison using precision-recall diagrams 
for retrieval results in a data base of 3D models showed that the method 
using spherical harmonics performed better. 



1 Introduction 

Currently methods for retrieving multimedia documents using audio-visual con- 
tent as a key in place of traditional textual annotation are developed in MPEG-7 

0. Many similarity-based retrieval systems were designed for still image, audio 
and video, while only a few techniques for content-based 3D model retrieval 
have been reported mizmiiiiiii. We consider 3D object retrieval in which 
a 3D model given as a triangle mesh serves as a query key and similar objects 
are retrieved from a collection of 3D objects. Content-based 3D model retrieval 
algorithms typically proceed in three steps: 

1. Normalization (pose estimation). 3D models are given in arbitrary units of 
measurement and undefined positions and orientations. The normalization 
step transforms a model into a canonical coordinate frame. The goal of this 
procedure is that if one chose a different scale, position, rotation, or orien- 
tation of a model, then the representation in canonical coordinates would 
still be the same. Moreover, since objects may have different levels-of-detail 
(e.g., after a mesh simplification to reduce the number of polygons), their 
normalized representations should be similar as much as possible. 

2. Feature extraetion. The features capture the 3D shape of the objects. Pro- 
posed features range from simple bounding box parameters 0 to complex 
image-based representations jS]. The features are stored as vectors of fixed 
dimension. There is a tradeoff between the required storage, computational 
complexity, and the resulting retrieval performance. 

3. Similarity search. The features are designed so that similar 3D-objects are 
close in feature vector space. Using a suitable metric nearest neighbors are 
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computed and ranked. A variable number of objects are thus retrieved by 

listing the top ranking items. 

We present an empirical study extending our contribution HH in which we in- 
troduced a modification of the Karhunen-Loeve transform and the application of 
spherical harmonics to the problem of 3D object retrieval. We first review the 3D 
model retrieval problem and previous work. Then we recall our approach based 
on spherical harmonics and present an alternative using moments. We describe 
our experiments that we designed to evaluate and contrast the two competing 
methods. Finally, the results and conclusions are presented. 

2 Previous Work 

The normalization step is much simpler than the pose estimation deeply stud- 
ied in computer vision where a 3D pose must be inferred from one or more 
images, i.e., projections of a 3D object. Here, the 3D models for the retrieval 
problem are already given in 3D space, and, thus, the most prominent method 
for normalization is the principle component analysis (PCA) also known as the 
Karhunen-Loeve transform. It is an affine transformation based on a set of vec- 
tors, e.g., the set of vertices of a 3D model. After a translation of the set moving 
its center of mass to the origin a rotation is applied so that the largest vari- 
ance of the transformed points is along the cc-axis. Then a rotation around the 
cc-axis is carried out so that the maximal spread in the yz-plane occurs along 
the 2 /-axis. Finally, the object is scaled to a certain unit size. A problem is that 
differing sizes of triangles are not taken into account which may cause widely 
varying normalized coordinate frames for models that are identical except for 
finer triangle resolution in some parts of the model. To address this issue we 
introduced appropriately chosen vertex weights for the PCA cni, while Paquet 
et al. 0 used centers of mass of triangles as vectors for the PCA with weights 
proportional to triangle areas. Later we generalized the PCA so that all of the 
(infinitely many) points in the polygons of an object are equally relevant for the 
transformation mi- 

Feature vectors for 3D model retrieval can be based on Fourier descriptors of 
silhouettes PEI. on 3D moments jE], rendered images or depth maps 0, or on 
volumetric representation of the model surface 0 or the corresponding volume 
(if the surface bounds a solid) [Tftiftij . 

Using special moments of 3D objects the normalization step may be joined 
with the feature extraction. In P] a complete set of orthogonal 3D Zernike poly- 
nomials provides spherical moments with advantages regarding noise effects and 
with less information suppression at low radii. The normalization is done using 
3D moments of degree not greater than 3. There were no examples demonstrating 
the performance of these feature vectors in 3D model retrieval. 

In this paper we consider a particular method to generate feature vectors for 
3D object retrieval. In a first step the 3D shape is characterized by a function on 
the sphere. For this function we empirically compare two kinds of representation, 
one using spherical harmonics and the other by computing moments. 
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Original 8^ harmonics 16^ harmonics 24^ harmonics 

Fig. 1. Multi-resolution representation of the function r(u) = max{r > 0 | ru £ 
/U{0}} used to derive feature vectors from Fourier coefficients for spherical harmonics. 



3 Functions on the Sphere for 3D Shape Feature Vectors 

In this section we describe the feature vectors used in our comparative study. As 
3D models we take triangle meshes consisting of triangles {Ti, . . . , T^}, Ti C 
given by vertices (geometry) {pi, . . . , p„}, = (xi,yi,Zi) G and an index 

table with three vertices per triangle (topology). Then our object is / = Ufci 
the point set of all triangles. We may assume that our models are normalized by 
a modified PCA as outlined in Section ^ For details we refer to m 

Some feature vectors can be considered as samples of a function on the sphere 
S^. For example, for a (normalized) model I define 

u H> max{r > 0 | ru G / U {0}} 

where 0 is the origin. This function r(u) measures the extent of the object in 
directions given by u G 5^. In m we took a number of samples r(u) as a 
feature vector, which, however, is sensitive to small perturbations of the model. 
In this paper we improve the robustness of the feature vector by sampling the 
spherical function r(u) at many points but characterizing the map by just a few 
parameters, using either spherical harmonics or moments. Other definitions of 
features as functions on the sphere are possible. For example, one may consider 
a rendered perspective projection of the object on an enclosing sphere, see |H|. 

The Fourier transform on the sphere uses the spherical harmonic func- 
tions Y)™ to represent any spherical function r G L^{S^) as r = 
Sz>o S|m|<z m)Yl^. Here r{l, m) denotes a Fourier coefficient and the spher- 
ical harmonic basis functions are certain products of Legendre functions and 
complex exponentials. The (complex) Fourier coefficients can be efficiently 
computed by a spherical FFT algorithm applied to samples taken at points 
Uy- = {xij, Dij , Zij ) = (cos (pi sin 9j , sin pi sin 9j , cos 9j ) , where pi = 2j7r/n, 
9j = {2j + l)n/2n, i,j = 0,...,n — 1, and n is chosen sufficiently large. We 
cannot give more details here and refer to the survey and software in P]. One 
may use the spherical harmonic coefficients to reconstruct an approximation of 
the underlying object at different levels, see Figure E An example output of the 
absolute values of the spherical Fourier coefficients (up to / = 3) is given here: 

1.161329 

0.063596 0.162562 0.063596 

0.213232 0.037139 0.373217 0.037139 0.213232 

0.016578 0.008051 0.009936 0.008301 0.009936 0.008051 0.016578 
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Fig. 2. Precision vs. recall results for varying dimensions in spherical harmonics (left) 
and moments (right). Results were averaged over all retrieval results in the class of 
airplane models. 



Feature vectors can be extracted from the first I + 1 rows of coefficients. This 
implies that such a feature vector contains all feature vectors of the same type 
of smaller dimension, thereby providing an embedded multi-resolution approach 
for 3D shape feature vectors. We have chosen to use only the absolute values 
as components of our feature vectors. Because of the symmetry in the rows of 
the coefficients (for real functions on the sphere coefficients in rows are pairwise 
complex conjugate) we therefore obtain feature vectors of dimension ^ = 

(Z -I- 1)(Z -I- 2)/2 for Z = 0, 1,2, . . ., i.e., 1, 3, 6, 10, 15, an so forth. 

An alternative to the representation of a spherical function by spherical har- 
monics is given by moments. To be consistent we sample the spherical function 
r(u) at the same points u^, i, j = 0, . . . , n — 1, as for the representation by 
spherical harmonics. As moments we define 

n—1 

Mq,r,s ^ 

i,j=0 

for q,r,s = 0, 1, 2, . . . The factor Astj represents the surface area on the sphere 
corresponding to the sample point = (cos sin 0^, sin sin 0^, cos 0j) and 
compensates for the nonuniform sampling. For example, when n = 128 we have 
Asij = ^{cos{9i — ^) —cos(0i + jfg))- For the feature vector we ignore 
and use l<g-|-r-|-s<m. Asm grows from 2 to 6 the dimension of the corre- 
sponding feature vectors increases from 9 to 19, 34, 55, and 83 (the dimension 
is (m -I- l)(m -I- 2)(m -I- 3)/6 — 1). 

4 Results and Conclusion 

For our tests we collected a data base of 1829 models which we manually classi- 
fied. For example, we obtained 52 models of cars, 68 airplanes, 26 bottles, and 
28 swords. On average a model contains 5667 vertices and 10505 triangles. We 
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Bottles 



Swords 





Recall 



Recall 



Fig. 3. Precision vs. recall results for four classes (airplanes, cars, bottles, and swords) 
using three methods, the ray-based feature vectors with spherical harmonics (Rays- 
SH), with moments (Ray-moments), and a method based on statistical moments [S| 
(Moments). The dimensions of the feature vectors are shown in the legends in brackets. 



used = 128^ = 16384 samples ^’(uij) of the spherical function for the com- 
putation of the spherical harmonics and the moments. For the nearest neighbor 
computation in feature vector space we used the Zi-distance. 

The retrieval performance can be expressed in so-called precision-recall dia- 
grams. Briefly, precision is the proportion of retrieved models that are relevant 
(i.e., in the correct class) and recall is the proportion of the relevant models ac- 
tually retrieved. By increasing the number of nearest neighbors in the retrieval 
the recall value increases while the precision typically decreases. By examining 
the precision-recall diagrams for different queries (and classes) we obtained a 
measure of the retrieval performance. For our tests we selected one class of ob- 
jects (e.g., cars) and used each of the objects in the class as a query model. 
The precision-recall values for these experiments were averaged and yielded one 
curve in the corresponding diagram. 

In our first test series we studied the dependency of the retrieval performance 
on the dimensionality of the feature vectors. The class of objects was given by 
the 68 airplanes. We conclude that both types of ray-based feature vectors yield 
a better performance when the dimension is increased. 
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In our second test series we compared the performance of the feature vectors 
for retrieving 3D models in four classes. In all cases the representation of the 
ray-based feature vector using spherical harmonics performed best, see Figured 
The graphs also include results for feature vectors based on statistical mo- 
ments from |S], defined as M'?’’’’'* = Si xj yl zf where the point {xi, yi,Zi) is 
the centroid of the i-th triangle and Si is the area of that triangle. Due to the 
normalization the moments with q -|- r -I- s < 1 are zero and can be omitted in 
the moment feature vectors. The retrieval performance was tested for several 
dimensions, and we found that the performance decreased for dimensions larger 
than 31. As shown in FigureElthe retrieval performance of these feature vectors 
was inferior to that produced by the ray-based feature vectors. 

To conclude we summarize that the Fast Fourier Transform on the sphere 
with spherical harmonics provides a natural approach for generating embedded 
multi-resolution 3D shape feature vectors. In tests using a ray-based feature 
vector the representation with spherical harmonics performed better than a rep- 
resentation using moments. 
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Abstract. Curve and surface fitting is a relevant subject in computer 
vision and coordinate metrology. In this paper, we present a new fit- 
ting algorithm for implicit surfaces and plane curves which minimizes 
the square sum of the orthogonal error distances between the model fea- 
ture and the given data points. By the new algorithm, the model feature 
parameters are grouped and simultaneously estimated in terms of form, 
position, and rotation parameters. The form parameters determine the 
shape of the model feature, and the position/rotation parameters de- 
scribe the rigid body motion of the model feature. The proposed algo- 
rithm is applicable to any kind of implicit surface and plane curve. 



1 Introduction 



Fitting of curve or surface to a set of given data points is a very common task car- 
ried out with applications of image processing and pattern recognition, e.g. edge 
detection, information extraction from 2D-image or 3D-range image. In this 
paper, we are considering least squares fitting algorithms for implicit model fea- 
tures. Algebraic fitting is a procedure whereby model feature is described by 
implicit equation A(a,X) = 0 with parameters a= (ai, . . . , aq)^ , and the error 
distances are defined with the deviations of functional values from the expected 
value (i.e. zero) at each given point. If F(a,Xi) 0, the given point X^ does 
not lie on the model feature (i.e. there is some error-of-fit). Most publications 
about LS-fitting of implicit features have been concerned with the square sum 
of algebraic distances or their modifications 



= ^F2(a,Xd 



or 



= ^[F(a,X,)/Vna,Xd]' 



In spite of advantages in implementation and computing costs, the algebraic 
fitting has drawbacks in accuracy, and is not invariant to coordinate transforma- 
tion. In geometric fitting, also known as best fitting or orthogonal distance fitting, 
the error distance is defined as the shortest distance (geometric distance) of a 
given point to the model feature. Sullivan et al. CH have presented a geometric 
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Fig. 1. Implicit surface, and the orthogonal contacting point x' in frame xyz from the 
given point Xi in frame XYZ. 



fitting algorithm for implicit algebraic features 

Q 

F(a,X) = = 0 , 

j 

minimizing the square sum of the geometric distances d(a,Xi) = ||Xi— X' ||, where 
Xi is a given point, and X' is the nearest point on the model feature F from X^. 
A weakness of Sullivan’s algorithm, the same as in the case of algebraic fitting, is 
that the physical parameters are combined into an algebraic parameters vector. 

We will now introduce a universal and very efficient orthogonal distance 
fitting algorithm for implicit surfaces and plane curves. Our algorithm is a gen- 
eralized extension of an orthogonal distance fitting algorithm for implicit plane 
curves HEi The new algorithm consists of two nested iteration parts. The inner 
iteration finds the orthogonal contacting point on the model feature from the 
given point (Section|2I), and, the outer iteration updates the estimation parame- 
ters (Section 01 . By the new algorithm, the estimation parameters a are grouped 
in three categories and simultaneously estimated. 

First, the form parameters a.g describe the shape of the standard model 
feature / defined in model coordinate system xyz (Fig. EJ) 

/(ag)^) = 0 with Hg = (ai, . . . . (1) 

The form parameters are invariant to the rigid body motion of the model feature. 
The second and the third parameters group (the position parameters ap and the 
rotation parameters a^) describe the rigid body motion of the model feature / 
in machine coordinate system XYZ 

X = R“^x -I- Xo or x = R(X — Xq) , where (2) 

R = R,.,R,pR,^ , R“^ = R^ , 

ap — XIq — (A 1 o 5 Yj, Zq^ , and a^. — (^; sdj 
In this paper, we intend to simultaneously estimate all these parameters 
a (ag , ap , a^. ) (ui, ... , u;, Xq, Zq^ cu, (p., k) (ui, . . . , ctg) . 

For plane curve fitting, we simply ignore all terms concerning z, Z, u, and tp. 
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/(ag,x) = /(ag,x,-i) , 



/(a,,x) = /(a„,x,-) 



/(ag,x) = 0 



/(ag,x) 




Fig. 2. Iterative search of the orthogonal contacting point x' on /(ag,x) = 0 from the 
given point Xi. The points x;,i and Xi ,2 are the first and the second approximation of 
xi, respectively. 

2 Orthogonal Contacting Point 

For each given point = R(Xi — Xq) in frame xyz, we determine the orthogonal 

contacting point x' on the standard model feature ©. Then, the orthogonal 
contacting point X' in frame XYZ to the given point X^ will be obtained through 
a backward transformation of x' into XYZ. From the fact that the connecting line 
of Xi with x' on /(ag, x) = 0 is parallel to the surface normal V/ at x' (Fig. GJ, 
we build the orthogonal contacting equation © defined in model coordinate 
system xyz, and directly solve it for x by using a generalized Newton method 
starting from the initial point of xq = x^ (Fig. EJ. (How to derive 9f /dx. is shown 
in Section ED: 



Alternatively, we can find x' minimizing the Lagrangian mu as follows: 
L(A, x) = (xi - x)'^(xi - x) + A/ , 



The iteration with converges relatively fast. Nevertheless, if it fails to con- 
verge, we catch it using the direct method (0). 

3 Orthogonal Distance Fitting 

The goal of the orthogonal distance fitting of a model feature to a set of given 
data points is the estimation of the feature parameters a minimizing the perfor- 
mance index 






where H = ^V/ . (4) 



= (X - X')'^P^P(X - X') , 



( 5 ) 
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where X and X' are coordinate column vectors (X"^ = (Xj^, . . . ,Xj^)), and 
is the weighting matrix or the error covariance matrix. The nearest point 
X' on the model feature from the given point X^ is to have been found by the 
algorithms described in Section |2| in a nested iteration scheme (inner iteration) . 
In this section, we are describing the update of the parameters a (outer iteration) . 

The first order necessary condition for the minimum of the performance index 
m as a function of the parameters a is 

= 0 = -2J^P'^P(X- X') , where J = . (6) 

In this paper, we iteratively solve (EJ for a by using the Gauss-Newton method 
PJZ\a = P(X - X') , a.k +1 ^ a.k + aAa. , (7) 



with the Jacobian matrices of each orthogonal contacting points X', directly 
derived from 



Jx'.a — 

’ oa 



X=X' 



,„_i5x aR-i dXo 

aa aa aa 



aa 



dR 



-1 



da^ 



( 8 ) 



The derivative matrix dx/da at x = x' in (0 describes the variational behavior 
of the orthogonal contacting point x' in frame xyz relative to the differential 
changes of the parameters vector a. Purposefully, we obtain dx/da from the 
orthogonal contacting equation 0 ). Because m has an implicit form, its deriva- 
tives lead to 

9f dx di dxi 9f 9f 9x / 9f dxi di \ 

dx da dxi da da dx da \ dxi da ^ da J ’ 



where, dxi/da is, from Xi = R(Xi — Xq), 



dx^ 

da 




Xo)-R 



axo 

da 



0 



-R 





The other three matrices di/dx, di/dxi, and di/da in (0 are to be directly 
derived from ©■ The elements of these three matrices are composed of simple 
linear combinations of components of the error vector (x — x) with elements of 
the following three vector/matrices V/, H, and G (FHG matrix): 



V/ 

5f 

dx 



\ ’ dy' dz J 

/ 0 0 

Vi-y -{xi-x) 

-(zi-z) 0 

\ 0 



H=t^V/ 

dx 



H 



-{vi-y]) 



I ^ 

dx 
df 

% 

dz 

0 



G = 

d£ 

_% 

dx 

0 

dz 



_d_ 

9ag 

dz \ 

0 

_% 

dy 



f 

V/ 



( 10 ) 
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Fig. 3. Information flow for orthogonal distance fitting of implicit features. 
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For the sake of a common practice of program source coding for curve and 
surface, we have interchanged, without loss of generality, the second and the 
fourth row of Then, for plane curve fitting, we consider only the first two 
rows of the modified equation ©■ Now, equation m can be solved for dx/da 
at x = x', and consequently, the Jacobian matrix ® and the linear system (0 
can be completed. The Jacobian matrix in o will be decomposed by SVD jS]: 

PJ = UWV"^ with U'^U = V'^V = I , W = [diag(wi, . . . , Wg)] , 



then the linear system (0 can be solved for Z\a. After a successful termination 
of iteration 0, along with the performance index Hq of o, the Jacobian matrix 
J provides useful information about the quality of parameter estimations as 
follows: 

Covariance matrix: Cov(a) = (j"^P"'"PJ) ^ ; 

1 

Parameter covariance: Cov{aj,ak) = 

i=l 
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Variance of parameters : 




Correlation coefficients : 




Cov {a j,ak) 



^/Coy{aj , aj)Coy{ak, dk) 



Using above information, we can test the reliability of the estimated parameters 
a and the propriety of the model selection (object classification). For example, 
we would try to fit an ellipsoid to a set of measurement points of a sphere- 
like object surface. Then, besides a~6~c and existence of strong correlations 
between the rotation parameters and the other parameters, we get very large 
variances of the rotation parameters. In this case, the rotation parameters are 
redundant here (over-parameterized), and, although the fitting of an ellipsoid to 
the points set has a better performance than a sphere fitting according to (IHIl . 
the proper model feature for the points set is a sphere. 

We would like to stress that only the standard model feature equation {IJ, 
without involvement of the position/rotation parameters, is required in li 1 1 )il . 
The overall structure of our algorithms remains unchanged for all dimensional 
fitting problems of implicit features. All that is necessary for a new implicit 
feature is to derive the FHG matrix of PI) from CD of the new model fea- 
ture, and to supply a proper set of initial parameter values Hq for iteration 0 . 
This fact makes possible the realization of a universal and very efficient orthog- 
onal distance fitting algorithm for implicit surfaces and plane curves. An overall 
schematic information flow is shown in Fig.0 

In order to demonstrate the capabilities of our algorithm, we give a not easy 
fitting example, a superquadric fitting. A superquadric ^ is a generalization of 
an ellipsoid, and is described in implicit form as below: 



where a, b, c are the axis lengths, and exponents £i, £2 are the shape coefficients. 
In comparison with the algebraic fitting algorithm m, our algorithm can also 
fit extreme shapes of a superquadric (e.g. a box with £ 1^2 ^1, or a star with 
S 1.2 >2). We obtain the initial parameter values set from a sphere fitting and an 
ellipsoid fitting, successively, with £^=£2 = 1 (Table Ej). Superquadric fitting to 
the 30 points in Tabled representing a box is terminated after 18 outer iteration 
cycles for ||Z\a|| = 7.9 x 10“^ (Fig.0). 

4 Summary 

Dimensional model fitting finds its applications in various fields of science and 
engineering, and, is a relevant subject in computer vision and coordinate metrol- 
ogy. In this paper, we have presented a new algorithm for orthogonal distance 
fitting of implicit surfaces and plane curves, by which the estimation parame- 
ters are grouped in form/position/rotation parameters, and simultaneously es- 
timated. The new algorithm is universal, and very efficient, from the viewpoint 



f{a,b, c,£i,£ 2 ,x; 



:) = -1 = 0 , 
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Table 1. Thirty coordinate triples representing a box 



X 


-4 


1 


4 


20 


-11 


-26 


-3 


-7 


6 


11 


V 


3 


16 


-11 


-17 


1 


7 


-13 


-26 


19 


24 


z 


13 


29 


10 


22 


4 


-8 


1 


-15 


9 


19 


X 


3 


15 


18 


-21 


-4 


-2 


-14 


20 


4 


6 


Y 


-18 


9 


-3 


3 


-14 


19 


14 


-17 


-20 


20 


Z 


-25 


21 


22 


-13 


-22 


11 


1 


15 


-18 


24 


X 


22 


30 


-8 


-16 


8 


26 


-22 


-2 


-3 


7 


Y 


-8 


-9 


15 


15 


-13 


-14 


12 


-22 


-3 


1 


Z 


4 


12 


-9 


-18 


-15 


17 


-13 


-20 


9 


-5 



Table 2. Results of the orthogonal distance fitting to the points set in TableQ 



Parameters a 


Co 


a 


b 


c 


£l 


£2 


Sphere 


33.8999 


46.5199 


— 


— 


— 


— 


Ellipsoid 


14.4338 


46.3303 


25.5975 


9.3304 


— 


— 


Superquadric 


0.9033 


24.6719 


20.4927 


8.2460 


0.0946 


0.0374 


cr(a) 


— 


0.1034 


0.1026 


0.0598 


0.0151 


0.0197 


Parameters a 


Xo 


To 


z. 


UJ 


‘P 




Sphere 


27.3955 


18.2708 


-20.8346 


— 


— 


— 


Ellipsoid 


1.6769 


-1.2537 


0.8719 


0.7016 


-0.7099 


0.6925 


Superquadric 


1.9096 


-1.0234 


2.0191 


0.6962 


-0.6952 


0.6960 


cr(a) 


0.0750 


0.0690 


0.0774 


0.0046 


0.0031 


0.0059 



of implementation and application to a new model feature. Memory space and 
computing time costs are proportional to the number of the given points. Our 
algorithm converges very well, and does not require a necessarily good initial 
parameter values set, which could also be internally provided (e.g. gravitational 
center and RMS central distance of the given points set for sphere fitting, sphere 
parameters for ellipsoid fitting, and ellipsoid parameters for superquadric fitting, 
etc.). If there is a danger of local minimum estimation, we apply the random 
walking technique along with the line search jn|. For practical applications, we 
can individually weight each coordinate of the given data points with the re- 
ciprocals of the axis accuracy of the measuring machine (see Eq. O). Together 
with other algorithm for orthogonal distance fitting of parametric features PI, 
our algorithm is certified by the German authority PTB , with a certification 
grade that the parameter estimation accuracy is higher than 0.1 /rm for length 
unit, and 0.1 /rrad for angle unit for all parameters of all tested model features. 



Least Squares Orthogonal Distance Fitting of Implicit Curves and Surfaces 405 




0 10 20 30 40 50 60 

Iteration number 



(a) (b) 

Fig. 4. Orthogonal distance fit to the points set in Table Q (a) Superquadric fit; 
(b) Convergence. Iteration number 0-42: ellipsoid, and 43-: superquadric fit {a>b>c). 
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Abstract. In this paper, we present a new approach for coarse segmen- 
tation of tubular anatomical structures in 3D image data. Our approach 
can be used to initialise complex deformable models and is based on an 
extension of the randomized Hough transform (RHT), a robust method 
for low-dimensional parametric object detection. In combination with 
a discrete Kalman filter, the object is tracked through 3D space. Our 
extensions to the RHT feature adaptive selection of the sample size, 
expectation-dependent weighting of the input data, and a novel 3D 
parameterisation for straight elliptical cylinders. For initialisation, only 
little user interaction is necessary. Experimental results obtained for 3D 
synthetic as well as for 3D medical images demonstrate the robustness of 
our approach w.r.t. image noise. We present the successful segmentation 
of tubular anatomical structures such as the aortic arc or the spinal 
chord. 

Keywords: 3D medical images, 3D tubular structure segmentation, 
minimal user interaction, randomized Hough transform (RHT), Kalman 
filter-based tracking 



1 Introduction 

Deformable models are often used to segment objects in complex 2D and 3D 
images (e.g. HD]). Since usually local optimisation methods are employed for 
deformable model fitting, model initialisation is generally required to be close to 
the real object to obtain reasonable fitting results. In particular, initialisation 
becomes a major problem for elongated and complicately shaped objects (such 
as long blood vessels), which are generally described by a large set of parameters. 
Typically, these model parameters have to be initialised manually. Thus, there 
is a clear need for automated methods, yielding an approximate segmentation 
of complex shaped tubular objects, while requiring minimal user interaction. 

Only a few approaches (e.g. H3E1) consider the segmentation of tubular 
structures in 3D medical image data with minimal user interaction. However, 
their drawbacks are that either an ad-hoc slice-tracking procedure is used in 
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conjunction with threshold-based determination of the tube wall na, or the 
method works only for a previously given scale of the tube diameter la- 
in this contribution, we introduce a new approach for segmentation of tubular 
structures in 3D image data. To detect objects in 3D image data, we extend 
in Sect. |2| the randomized Hough transform (RHT), a robust method for low- 
dimensional parametric object detection, while in Sect. 0we describe how this 
method can be combined with a discrete Kalman filter to track the objects 
through 3D space. Segmentation is achieved by detecting elliptical cross sections 
or straight elliptical cylinder segments that are subsequently tracked through 
the 3D image. Our algorithm has been applied to both 3D synthetic and 3D 
medical images (Sect.E|). 

2 Extensions of the Randomized Hongh Transform 

The Hough transform is a well-known method for parametric object detection, 
where detected pixels in the input image are mapped to a discrete parameter 
space, whose maxima represent object candidates. In order to detect ellipses or 
straight elliptical cylinders, which have a relatively large number of parameters 
(leading to excessive space/time complexity for the conventional Hough trans- 
form), we build upon the so-called randomized Hough transform (RHT) PE|- 
The RHT uses a randomly chosen subset of the input data, together with 
a so-called many-to-one sampling scheme that maps input point sets to zero- 
dimensional point sets in the parameter space. Thus, parameter spaces of higher 
dimensions remain tractable, if a dynamic accumulation scheme is used. Several 
drawbacks inherent to coarse-to-fine and parameter-space decomposition Hough 
transforms, such as increased noise sensitivity and projection artifacts |4lh| . can 
be alleviated by the RHT, because the parameter space can always have both 
full dimension and resolution. Our proposed extensions include a new object 
parameterisation for straight elliptical cylinders, new derivations for an adaptive 
sample size, and a novel input weighting scheme. 



2.1 Object Parameterisations 

For the object parameterisation of an ellipse, we apply an approach proposed 
for the Hough transform in 0 p. 151], which determines a unique ellipse (if 



there is any) that passes through a 
set of five coplanar points. The ad- 
vantages over other methods are that 
no additional information (such as lo- 
cal edge direction) is needed, and be- 
cause the problem is reduced to linear 
equations, standard numerical meth- 
ods can be applied. As discussed in 
Sect.|3l the plane from which the sam- 
ples are taken should be orthogonal to 




Fig. 1: Determining the cylinder axis 
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the tube axis. The straight elliptical cylinder is a degenerate quadric, so that a 
direct approach in the above sense leads to a rather complex non-linear algebraic 
problem. For the sake of speed, we therefore devised the following two-step al- 
gorithm for straight elliptical cylinder parameter calculation, using the fact that 
it is a ruled surface, but still avoiding the use of local edge direction (which is 
hard to estimate in a robust way with the required accuracy) : 

1. Select randomly five coplanar points and determine the ellipse E through 
them (if there is any) . 

2. Select randomly two distinct points p and q outside the plane containing E. 
Sweep a line through p and points along ellipse E, and calculate the ellipse 
E' in the same plane generated by sweeping simultaneously a parallel line 
through q (see Fig. P). The intersection points of these two ellipses (it can 
be shown that there are at maximum two) determine the possible cylinder 
axes: A line through one of those intersections and q (solid line in Fig. 

is parallel to the cylinder axis, which of course intersects the two ellipses’ 
plane at the center of E. Having determined the axis, the remaining cylinder 
parameters can easily be calculated. 



2.2 Adaptive Sample Size 



When using random sampling schemes for the Hough transform, it is normally 
no longer necessary to consider all points of the input data set. Usually, a sub- 
set will suffice, whose size depends on the input data quality. This subset is 
typically generated by drawing a sample of a given size from the input data 
points. Instead of the formulas given in ESI, which are only hints to sample size 
selection, we suggest a different way to derive an estimation of an appropriate 
(data-dependent) sample size. This enables us to state an upper bound for the 
probability of false detections. Although the derivation is strictly correct only 
for an unlimited sample size, in practice the given error limit is almost always 
valid. 

Because the counts in the accumulator cells are binomially distributed uni, 
we can approximate an individual cell’s count for large sample sizes by the 
normal distribution. Thus, if we know a priori the probability ps for sampling 
points from a significant object and the probability pns for sampling points from 
a non-significant object, it can be shown that if the sample size satisfies 



n > 



(Vps (1 -Ps) + \/Pns (1 -p„s)) 



{Ps -PnsY 



( 1 ) 



then with probability of at least 1 — erf(z/2) the counts of significant cells are 
larger than those of non-significant ones. This actually prevents non-significant 
objects from being erroneously detected, because for the RHT, the detected 
object coincides with the maximal accumulator count. The value 2 ; thus param- 
eterises the sample size and the remaining error probability. 

We can refine this a priori sample size estimate by calculating the probability 
p'g, which denotes the actual probability for sampling points from an object 
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corresponding to a given cell, using the theorem of the iterated logarithm by 
Khintchine (e.g. 0 pp. 204]): 



P's> 



nk + nlog(log(n)) — y^nlog(log(n)) {2nk — + nlog(log(n))) 

V? + 2nlog(log(n)) 



( 2 ) 



where n denotes the current sample size and k the current cell count. As it is 
only required for the most significant object’s cell count to be larger than non- 
significant cells, we can substitute for ps in m the estimate p'^ calculated from 
the maximum accumulator cell count to determine a data-dependent sample size. 



P (x,y) 



2.3 Windowing the Hough Transform 

In images from cluttered environments, non-relevant image structures of- 
ten distract the RHT from the relevant features, especially when the ob- 
jects to be detected are tiny compared to the image size. Therefore, mask- 
ing out all but small areas around the features has proven to be advanta- 
geous m, where usually rectan- 
gular, binary-valued windows are 
used. We extend masking towards 
arbitrary discrete functions that 
are adapted to the problem at 
hand: As the input pixels are se- 
lected randomly for the RHT, we 
have to induce an appropriate dis- 
crete probability density function 
onto the input image, with high 
values at the most possible ob- 
ject locations. These arbitrary dis- 
crete probability functions can be 
realised by employing the so-called 
alias method described in m 

As the Kalman filter (explained below) yields estimates of the expected position, 
size, and shape of the input object in the next slice, the image can be windowed 
appropriately. Assuming a Gaussian distribution of the ellipse’s main axes errors 
and exact estimates of the center position and the rotation angle, the window 
function in Fig. ^shows the incorporation of Kalman prediction into the RHT. 




Fig. 2: Example of a continuous window func- 
tion for ellipse detection 



3 Kalman Filter Approach 

In order to track the detected objects (either ellipses or straight elliptical cylin- 
ders) through 3D image data, we apply a discrete Kalman filter jS|. Apart 
from the useful prediction information which can be exploited as mentioned in 
Sect . 12.31 this approach allows to include an explicit model of the tubular struc- 
ture’s axis curve. For modeling this axis, we used both a linear and a quadratic 
Taylor approximation. 
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If Xk is a point on the tube axis at time kAt, an estimate at instant {k + l)At 
can be calculated as follows (quadratic model): 



Xk+i = Xk + + -At'^ 






'^k+l 



Atx' 



k ’ 



^/c+1 



(3) 



where At denotes the time interval and x'f., x'l the first and second derivative 
w.r.t. time, resp. 

Therefore, taking together the results from the previous sections, we were able 
to implement a tube tracker that adapts itself to the input data quality and re- 
stricts random sampling effectively to the Kalman-predicted area. Furthermore, 
the Kalman prediction of the axis curve offers an additional advantage: If the 
intersection plane is not orthogonal to the cylinder axis, even tubular structures 
with elliptical cross sections generally do not yield ellipses when intersected by a 
plane (see e.g. m or CH for a more in-depth discussion) . Thus, when employing 
the detection of elliptical cross sections, it is essential to re-slice the 3D input 
data locally, orthogonal to the Kalman prediction of the axis curve. Otherwise, 
when tracking highly curved structures, the plane intersection curve will differ 
significantly from an ellipse after a few steps. This can lead to inaccuracies due 
to model mismatch. 



4 Experimental Results 

We performed about 20,000 experiments using 3D synthetic binary data with 
varying object sizes and noise levels to assess the performance and robustness 
for all four possible combinations of the described object parameterisations (el- 
lipse and straight elliptical cylinder) and the axis approximation (linear and 
quadratic). It turned out that the detection of elliptical cross sections and the 
tracking with the quadratic model in m performed best. For this variant we 
show in Fig. 0one of the binary input images used (slices of a torus), which was 
degraded with 10% shot noise, and the rendered segmentation result. The plot 
on the right hand side of Fig. 0 depicts the mean fraction of successfully seg- 
mented 3D input images (solid line). This means that the whole arch was tracked 
and the detected axis was never farther away from the real axis than twice the 
radius. The mean of that distance from the true arch axis is depicted by the 
dashed line, whereas the vertical error bars for both plots denote the standard 
deviation from the corresponding mean. The results are promising, especially 
regarding noise insensitivity. Up to a noise level of 15%, the test structures were 
always fully tracked. The small systematic error w.r.t. the true torus axis (~ 0.5 
pixel) is due to discretisation artifacts in the RHT accumulator. In compari- 
son, the other variants were more sensitive to noise. However, for small noise 
levels, tracking elliptical cylinder segments was significantly more accurate than 
tracking elliptical cross-sections. 

We also conducted experiments using 3D Magnetic Resonance (MR) data, 
applying the same approach as above (detection of elliptical cross sections and 
tracking with the quadratic model). To generate the binary data sets necessary 
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Fig. 3. Experiments using 3D synthetic data. The two images on the left hand side 
show at the top a slice of a 3D input data set (torus, degraded with 10% shot noise) 
and below the segmentation result. On the right hand side, a plot of two statistics for 
the experiments with the method using detection of elliptical cross sections and the 
quadratic axis model is given, where every point is the mean value of 20 to 40 separate 
experiments. 



for the algorithm, we used a 3D extension of the edge detection method described 
in fp. Figure El shows the segmentation result of the human aorta in a thorax 
MR angiography after 3D edge detection. The only initial parameters required 
for the algorithm were the start point of the aorta axis, the initial radius, and 
an approximate initial direction. The algorithm then continued segmentation 
through 3D space until the bottom of the image data volume was reached. 

Using the same approach, Fig. El shows the result for a 3D MR image of the 
human head, where the spinal chord was segmented, starting from the bottom 
of the image up to the medulla oblongata. 

5 Summary 

We have proposed a new segmentation method for tubular structures in 3D image 
data, which is based on an extension of the randomized Hough Transform and a 
discrete Kalman filter. Experimental results obtained for both 3D synthetic and 
3D MR image data showed promising results, especially regarding the robustness 
w.r.t. noise. Out of the four variants we investigated, the one using a quadratic 
axis model and the detection of elliptical cross sections performed best. Because 
of its robustness and the small number of required initial values, we characterize 
our novel algorithm to be well suited for coarse segmentation as well as for 
initialisation of complex deformable model fitting. 
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(a) Coronal slice of the thorax 



(b) 3D segmentation of the aorta 



Fig. 4. Experiments using real 3D MR-angiography data of the thorax. CPU time 
(without prior 3D edge detection) for segmentation of the aorta was 347.29 sec on a 
300 MHz Sun Ultra 2 workstation. The arrow marks the position of the aorta when 
crossing the image plane. 




(a) Sagittal slice of the (b) 3D segmentation of the spinal chord 

human head 



Fig. 5. Experiments using real 3D MR data of the human head. CPU time (without 
prior 3D edge detection) for segmentation of the spinal chord (see arrow) was 108.46 sec 
on a 300 MHz Sun Ultra 2 workstation. 
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Abstract. The purpose of this paper is to discuss pros and cons of fit- 
ting general curves and surfaces to 2D and 3D edge and range data using 
the Euclidean distance. In the past researchers have used approximate 
distance functions rather than the Euclidean distance. But the main dis- 
advantage of the Euclidean fitting, computational cost, has become less 
important due to rising computing speed. Experiments with the real Eu- 
clidean distance show the limitations of suggested approximations like 
the Algebraic distance or Taubin’s approximation. We compare the per- 
formance of various fitting algorithms in terms of efficiency, correctness, 
robustness and pose invariance. 

1 Introduction 

The ability to construct CAD or other object models from edge and range data 
has a fundamental meaning in building a recognition and positioning system. 
While the problem of model fitting has been successfully addressed, the problem 
of a high accuracy and stability of the fitting is still an open problem. On the 
one hand it is imperative to solve the problem of how curves and surfaces can be 
fitted to a given data set. But on the other hand it is obvious that accuracy and 
stability of the fitting has a substantial impact on the recognition performance 
especially in reverse engineering where we desire an accurate reconstruction of 
3D geometric models of objects from range data. Thus it is very important to 
get good shape estimates from the data. 

Implicit polynomial curves and surfaces are potentially among the most useful 
object or data representations for use in computer vision and image analysis. 
Their power appears by their ability to smooth noisy data, to interpolate through 
sparse or missing data, their compactness and their form being commonly used in 
numerous constructions. An implicit curve or surface is the zero set of a smooth 
function / : R" — >■ K™ of the n variables: Z{f) = {x : f{x) = 0}. Let f{x) be 
an implicit polynomial of degree d given by 



Then, we only have to determine the parameter set {a^fc} that describes the 
given data best. 

* The work was funded by the CAMERA {CAA Modelling of Built Environments 
from Range Analysis) project, an EC TMR network (ERB FMRX-CT97-0127). 
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2 Least Squares Fitting of General Curves and Surfaces 

Parameter estimation, usually cast as an optimization problem, can be divided 
into three general techniques: least-squares fitting (e.g. pilMiyil()ll‘i| l. Kalman 
filtering (e.g. Big), and robust techniques (e.g. BEI). Given a finite set of data 
points V = {xp}, p S [1, P], the problem of fitting a general curve and surface 
Z{f) to 2? by a least-squares method is to minimize a distance measure 

1 P 

— ^ dist (Xp,Z{f)) — >■ Minimum (2) 

p=i 

from the data points Xp to the curve or surface Z(f), a function of the parameter 
set {aijk}- The distance from the point Xp to the zero set Z{f) is defined as the 
minimum of the distances from Xp to points Xt S Z(f): 

dist(a;p, Z{f)) = min {|| Xp -Xt\\: f{xt) = 0} . (3) 

In the past researchers have often replaced the real Euclidean distance by an 
approximation. But it is well known that a different performance function can 
produce biased results, and for a lot of primitive curves and surfaces a closed 
form expression exists for the Euclidean distance. In the following we summarize 
the Algebraic fitting, Taubin’s fitting [I13i14j and an Euclidean fitting |7liS] . 

Algebraic fitting (AF) is based on the approximation of the Euclidean dis- 
tance by the algebraic distance dist a {xp,Z{f)) = f{xp). Given the Alge- 
braic distance for each point, Eq. (0 can be formulated as an Eigenvector 
problem. To avoid the trivial solution {aijk} = 0 and any multiple of a so- 
lution, the parameter set {aijk} may be constrained in some way. The pros 
and cons of using algebraic distances are the gain in computational efficiency, 
because closed form solutions can usually be obtained, but often the results 
are unsatisfactory. 

Taubin’s fitting {TF) uses the first order approximation of Eq. 0 to esti- 
mate distT{xp,Z{f)) IT^ . 

= (4) 

The pros and cons of using Taubin’s distance are no iterative procedures are 
required and it is a first order approximation to the exact distance, but the 
approximate distance is also biased in some sense. If, for instance, a data 
point Xp is close to a critical point of the polynomial, i.e., \\\7f{xp)\\ fv 0, 
but f{xp) ^ 0, the distance becomes large which is certainly a limitation. 
To minimize Eq.® the usage of the Levenberg-Marquardt (LM) algorithm 
is proposed. 

Euclidean fitting {EF) replaces the approximated distances again by the Eu- 
clidean distance, which is invariant to transformations in Euclidean space 
and not biased. For primitive curves and surfaces like straight lines, ellipses 
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[T5lTH| . planes, cylinders, cones, and ellipsoids, a closed form expression ex- 
ists for the Euclidean distance from a point to the zero set and we use 
these. However, as the general expression of the Euclidean distance is more 
complicated and there exists no known closed form expression, an iterative 
optimization procedure must be carried out. Given the Euclidean distance 
distE{xp, Z{f)) for each point the following simple algorithm can be used: 

1. The Euclidean fitting requires an initial estimate for the parameters 
Wijk} and we have found that the result of Taubin’s fitting method is 
more suitable than others. We get the initial parameter set 

2. In the second step = 0, 1, . . ., is updated using the LM algo- 

rithm, minimizing the sum of Euclidean distances for all data points. 

3. Finally, each is evaluated by a M- estimator C on the basis of 

dist E{xp,Z{f)). If is accepted 

and the fitting will be continued with step 2. Otherwise the fitting is 
terminated and {oyfe}!'*! is the desired solution. 



3 Evaluating the Euclidean Fitting 

To work out the pros and cons of Euclidean fitting we compare the perfor- 
mance of the EF method with the performance of AF and TF in terms of 
efficiency, correctness and robustness for both simulated and real data. In case 
of simulated data we have generated data sets which describe (elliptical) cylin- 
ders and cones. The 3D data were generated by adding isotropic Gaussian noise 
a = {1%, 5%, 10%, 20%}. Additionally the surfaces were partially occluded. The 
visible surfaces were varied between 1/2 (maximal case) and 1/6 of the full 3D 
cylinder. To show that the EF works even for real data we have used several 
range data sets. For all experiments we include in all three fitting methods the 
same constraints which describe the expected surface type to enforce the fitting 
of a special surface type. Finally, we look to the pose invariance of the fitting 
methods. 



3.1 Efficiency 

A good fitting algorithm has to be efficient as possible in terms of run time and 
formal complexity. While the problem of computational cost is no longer a really 
hard problem because of the rapidly increasing machine speed, we should guar- 
antee the fitting with acceptable computational cost as well as the algorithm 
with relatively low complexity. All algorithms have been implemented in G and 
the computation was performed on a Pentium III 466 MHz. The average com- 
putational costs for the AF, TF and EF are in Tab. QJ As expected the AF and 
TF supply the best performance. The EF algorithm requires a repeated search 
for the point Xt closest to Xp and the calculation of the Euclidean distance. A 
quick review of the values in Tab^ shows that the computational costs increase 
if we fit an elliptical cylinder, a circular or an elliptical cone respectively a gen- 
eral quadric, because the distance estimation is more complicated. In summary 
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Table 1. Average computational costs in milliseconds per 1000 points. 





AF 


TF 


EF 


Plane 


0.958 


1.042 


2.417 


Sphere 


1.208 


1.250 


3.208 


Circular cylinder 


3.583 


3.625 


12.375 


Elliptical cylinder 


13.292 


13.958 


241.667 


Circular cone 


15.667 


15.833 


288.375 


Elliptical cone 


15.042 


15.375 


291.958 


General quadric 


18.208 


18.458 


351.083 



the efficiency is a con of EF^ but is bounded by a factor of about 20 times the 
performance of the other algorithms and is still computationally reasonable for 
up to 10® data points if real-time performance is not needed. 

3.2 Correctness 

It is obvious that the fitting result should describe the data set by the correct 
curve or surface type. That means that it should not fit a false type to the 
data. To verify the correctness we tested if the fitting result of the (constrained) 
eigenvalue analysis corresponds to the general curve or surface invariants. If one 
solution satisfies the conditions for one curve or surface type, it is assumed that 
the fitting is correct in sense of an interpretable real curve or surface. Otherwise, 
the fitting will be defined as failure. In our experiments AF failed sometimes (up 
to 23 percent) respecting our expectations, especially with higher noise levels or 
a sparse data set (see Sec. 13.311 . For TF and EF we had no failures in our 
experiments. In summary the correctness is a pro of EF. 



3.3 Robustness 

A fitting method must degrade gracefully with increasing noise in the data, with 
a decrease in the available relevant data, and with an increase in the irrelevant 
data. To evaluate the robustness of the proposed EF, we use synthetic generated 
data describing an elliptical cylinder by adding isotropic Gaussian noise a = 
{1%,5%, 10%, 20%} and partially occlusion varied between 1/2 (maximal case) 
and 1/6 of the full 3D cylinder. In the first experiment the number of 3D points 
for the simulated cylinder was n = 100 and to measure the average fitting error 
each experiment runs 100 times. The reported error is the Euclidean geometric 
distance between the 3D data points and the estimated surfaces. The mean 
squares errors (MSE’s) and standard deviations of the different fittings are in 
Fig.lU As expected, TF and EF yield the best results respect with to the mean 
and standard deviation, and the mean for EF is always lower than for the other 
two algorithms. The results of AF are only partially acceptable, because of the 
mean and the standard deviation. In the direct comparison of TF with EF the 
results of EF are much better and the mean of EF is always lower than the mean 
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c. Euclidean fitting. 



Fig. 1. Average least squares error fitting a synthetic generated cylinder with added 
Gaussian noise cr = {1%, 5%, 10%, 20%}. The visible surfaces were varied between 1/2 
(maximal case) and 1/6 of the full 3D cylinder. The number of trials was 500. 



of the other two algorithms. As mentioned in See 13. 21 AF can give sometimes 
wrong results which means that the fitted curve or surface types does not come 
up with our expectations. We removed all failed fittings out of the considerations. 

In the second experiment, the number of 3D points was stepwise decreased 
from n = 1000 down to n = 10 3D data points to evaluate the behaviour of 
the several fitting methods. Each experiment runs 100 times. The mean squares 
errors {MSB’s) and standard deviations of the different fittings are in Fig. |21 
As expected, TF and EF yield also the best results in this experiment. With 




a. Algebraic htting. b. Taubin’s fitting. c. Euclidean fitting. 



Fig. 2. Average least squares error fitting a synthetic generated cylinder with added 
Gaussian noise a = {1%, 5%, 10%, 20%}. The number of 3D points was stepwise de- 
creased from 1000 up to 10. The visible surfaces was 5/12 of the full 3D cylinder. The 
number of trials was 500. 



decreased point density especially the AF becomes more and more unstable 
which is reflected in the mean and standard deviation. Unexpectedly, the EF is 
very stable even with only n = 10 3D data points. This underlines once more 
the outstanding performance of the EF. In summary the robustness is clearly a 
pro of EF. 
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3.4 Pose Invariance 

It is obvious that the fitting results should be pose invariant. But, it is well known 
that this reasonable and necessary requirement cannot be always guaranteed by 
all three fitting methods. To evaluate the pose invariance we use a real data 
set (see Fig.O^L.) describing an elliptical cylinder. The normalized data set was 
a) shifted, b) rotated, and c) both rotated and shifted. A quick review of the 
residuals (MSE) in Tab.0 shows AF and TF are not pose invariant while the 
EF is pose invariant. To illustrate the pose dependency, the fitting results for 
position 3 are visualized in Fig. 0 In summary the pose invariance is clearly a 
pro of EF. 



Table 2. Residuals fitting an elliptical cylinder (see Fig. |3}. The normalized cylinder 
was shifted by t = [0.3, 0.2, 0.1] (pos. 1), rotated by = 7r/12 and n = [0.5, 1.0, 0.5] 
(pos. 2), shifted and rotated (pos. 3). 







normal pos 


position 1 


position 2 


position 3 


AF 




0.5242 


2.0181 


2.6950 


1.8271 


TF 


[10-^1 


0.5024 


1.5143 


2.0277 


1.3817 


EF 


[10-^] 


0.4021 


0.4152 


0.8634 


0.6088 



4 Conclusion 

The focus was on the pros and cons of Euclidean fitting compared with the 
commonly used Algebraic fitting and Taubin’s fitting. Referring to our objective 
we can finally conclude that we have more pros than cons for the Euclidean 
fitting. While the main disadvantage of the Euclidean fitting, computational 
cost, has become less important due to rising computing speed, robustness and 
accuracy increases sufficiently compared to both other methods. Additionally, 
the Euclidean fitting is pose invariant. 
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a. normalized data. b. Algebraic fitting. 







c. Taubin’s fitting. d. Euclidean fitting. 



Fig. 3. Fitting results for a real range data (« 3300 points). The normalized data set 
was shifted by t = [0.3, 0.2, 0.1] and rotated by = 7t/12 and n = [0.5, 1.0, 0.5]. 
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Abstract. We describe an Augmented Reality system using the comers of a color 
cube for camera calibration. In the augmented image the cube is replaced by a 
computer generated virtual object. The cube is localized in an image by the CSC 
color segmentation algorithm. The camera projection matrix is estimated with a 
linear method that is followed by a nonlinear refinement step. Because of possible 
missclassifications of the segmented color regions and the minimum number of 
point correspondences used for calibration, the estimated pose of the cube may 
be very erroneous for some frames; therefore we perform outlier detection and 
treatment for rendering the virtual object in an acceptable manner. 

Keywords: Augmented Reality, camera calibration, color segmentation 



1 Introduction 

In this paper a system is introduced which is an application of Augmented Reality, a 
visual enhancement of real environments. Unlike many other applications in this field 
the system described here uses no independent tracking device, but follows a so-called 
vision-based approach, i. e. calibration information is derived solely from camera in- 
put. Many common vision-based systems m use some kind of calibration pattern or 
fiducials still visible in the scene even after augmentation, e. g. ini. Other methods 
require manually selected control points m , but do not use fiducials. In our approach, 
a metal cube with a side length of 6 cm is used that is painted with a different color on 
each side such that its position and orientation can be determined unambiguously. A real 
scene can be augmented by rendering a virtual object in the same pose, thus replacing 
the cube. A possible application of this system is the usage of a head-mounted display 
for the visualization of three-dimensional objects that do not yet exist or only at distant 
locations. The system works without human intervention, and no explicit calibration step 
is required before using it. The main advantages of our approach comprise the usability 
in indoor and outdoor applications, easy interaction with the user, and the possibility of 
illumination estimation by exploiting the knowledge about the colors used on the cube. 

* This work was partially funded by the German Science Foundation (DEG) under grant SFB 
603/TP C2. Only the authors are responsible for the content. 
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Fig. 1. Overview of system structure with special focus on detection of cube corners. 



2 System Overview 



The first - and most time-consuming - task of the system is to detect the cube in the image 
and to determine its corners, thus establishing a correspondence between 2-D image and 
3-D world points. In order to get a correct calibration of the camera, the identification 
of the corners must be as accurate as possible. Details on the method described here can 
be found in lfT5l . 

The system is basically structured as shown in Fig.^J After an initialization phase 
and the training of the color classificator, the main loop of the cube tracker is started. 
This is the manager of the whole system: it contains general information, like the image 
size and previously known positions of the cube, and it executes the subsequent phases. 

In each loop the tracker reads the current image and preprocesses it as described in 
Sect. 13 Color segmentation is done in two steps: first a version of the image that was 
reduced in size is segmented and checked for cube canditates. If one or more candidates 
are found, the corresponding image regions are segmented once more, this time in normal 
resolution. The color segmentation yields a number of uniformely colored regions. Cube 
candidates are identified out of these regions by applying a number of restrictions on sets 
of three or two regions. That is, their colors have to correspond to those of the cube and 
they must fulfill some geometric properties. If none of the region sets fits these qualities, 
the tracker discards this image and proceeds with the next one. After that, a set of one or 
more cube candidates remains, each of which is given a score depending on how well it 
fits the conditions above. The corner estimator then tries to find the comers of the cubes 
with the highest scores. If this step fails, the candidate is rejected and the next one is 
tried. If no candidate remains, the tracker continues with the next image as well. 

With the corner points identified, it is now possible to do camera calibration. Details 
on that part are given in Sect.^l The final step in the cycle is to render the virtual object 
that is to replace the cube in the image. From this point on the sequence starts again with 
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the acquisition of the next image frame. If a cube was recognized, its position is taken 
as a cue for where to start the search in the next image. 



3 Finding the Cube 

For locating the corners of the cube, two of its properties are used: its color and its 
geometry. In order to distinguish the cube from the background it was painted in luscious, 
matte colors. Each side received a unique color such that the system can identify the 
cube’s correct orientation. Of course, in a natural scene it is very likely that other objects 
are colored similarly. Therefore the geometric appearance of the objects in consideration 
has to fit that of a cube as well. The input of the cube detector is a stream of RGB images. 
An enhancement of quality is reached by applying a Symmetric Nearest Neighbour 
(SNN) filter as suggested in 01 which reduces noise like a mean filter, but preserves 
edges. Each side of the cube is a region of almost uniform color, except for shadows 
cast onto it. By choosing matte colors the problem of highlights reflected from a light 
source was mostly eliminated. Therefore the first step in identifying the cube is a color 
segmentation of the image. The algorithm used for this is the Color Structure Code 
(CSC) as described in ini, which is based on an earlier method |(5|. 

It is assumed that each visible side of the cube can be segmented as exactly one 
region, given that a coarse enough parameterization for the CSC is used. A division of 
a side happens only in extreme cases, e. g. if a shadow is partially cast over it. Thus 
each region in the image is classihed for being a possible cube side or not, according 
to its mean color. The numerical classifier applied here uses fwo feafure vectors, the 
RGB values of a region’s color and its hue and saturation values, taken from HSV 
color space. The distance between a region’s mean color and a cube side color class is 
calculated by the Mahalanobis distance measure. As the color segmentation is costly, 
the image is first shrunk to a fifth of its size and possible cubes are localized herein. The 
corresponding windows are cut out of the full-size image, where the cube detection is 
performed again. After color classification a set of regions remains containing all the 
regions that possibly are part of the cube. Erom these the ones truly belonging to the 
cube have to be filtered out. The subsequent processing of the image requires that at 
least two sides of the cube are visible - a criterion which reduces the number of cube 
side candidates, because the knowledge of adjacent colors on the cube can be exploited. 
In addition, the regions must have the correct appearance: their edges have to be straight 
lines, forming a parallelogram. Each of the remaining regions gets a score measuring 
the regions’ affinity to the cube. Corners are determined for the regions with the highest 
scores. This is done by approximating the edges of each side with straight lines using 
the Hough transform. The intersections of these lines are taken as the corners of the side. 
In cases where lines are missing due to occlusion, the corners are approximated using 
information from neighboring sides or the original image in Cartesian coordinates. 

The algorithms for camera calibration applied in the following sections need at least 
six or seven point correspondences, depending on the method. With one side of the cube 
visible only four points can be obtained, so that two or three sides have to be visible, 
providing six or seven point pairs respectively. In the case where only one side is visible, 
the method described in Sect. 14. 41 can be applied. 



424 



J. Schmidt, I. Scholz, and H. Niemann 



4 Camera Calibration 

Camera calibration is done in the following steps, which are described in detail after- 
wards: Computation of additional point correspondences (Sect. R~Tt . linear calibration of 
all camera parameters (Sect. K7\ . maximum-likelihood estimation of the focal lengths 
tSect. B"^ . nonlinear refinement of extrinsic camera parameters (Sect. IQ) . and test of 
the validity of the computed parameters (Sect. E3- 

4.1 Computing Additional Point Correspondences 

Either six or seven 2-D/3-D point correspondences are established by the algorithm 
described in the previous section. While six points are enough for one of the methods 
described below, at least seven are needed for applying the algorithm of Tsai. 

Since we use an object of known shape we can compute one additional point corre- 
spondence for each side where all four corners have been detected by using the inter- 
section of the two diagonals of a cube side. These additional points are easy to compute 
in the image and in 3-D and are valid correspondences because the perspective projec- 
tion preserves intersections. Using projective geometry (see O for an introduction) the 
intersection qs of the diagonals in the image plane can be computed by 

Qs = {Qi X Qs) X (q2 X 94 ) , (1) 

where qs,Qi (* = 1, • ■ ■ , 4) are 3-vectors representing image points in homogeneous 
coordinates, qi is opposite to < 73 , q2 is opposite to 54 . 

4.2 Linear Calibration 

For calibration we assume a perspective camera model. A homogeneous 3-D point Wi 
is projected onto a homogeneous 2-D point qi in the image plane using the following 
equation: 

q, = Pwi = - t)wi , ( 2 ) 

where IT is a 3 x 3 matrix containing the intrinsic parameters fx, fy, uq, and vq, i? is a 
rotation matrix whose columns correspond to the axes of the camera coordinate system, 
t is a translation vector giving the position of the camera’s optical center, and I 3 is the 
3x3 identity matrix. 

After the previous steps there are enough point correspondences to apply a linear 
calibration method. For this purpose we use either the algorithm of Tsai that can be 
found in > I KII7I . or the algorithm described in III VI for estimating the projection ma- 
trix. Radial distortions are neglected, the angle between the axes of the image reference 
frame is assumed to be 90°. Tsai’s method assumes that the principal point is known 
and the camera parameters are computed directly, in contrast to the second method used 
which estimates the projection matrix first and makes no assumptions about the principal 
point. Both methods require non-coplanar point correspondences and the orthogonal- 
ization of the resulting rotation matrix which can be done by applying a singular value 
decomposition (SVD) m. 
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Since fx, fy are assumed to be constant over the whole sequence, we can get 
maximum-likelihood estimates fx, fy of these parameters at frame t {t = 1,2,...) 
under the assumption of normally distributed, isotropic, and zero-mean noise by the 
following recursive equation (given here for fx only): 

fx,t — ^ ^ — l “t” fx,t^ — fx,t—l “t” ^ ^ fx,t — l^ : (3) 

where fx^t is the result of the linear calibration at time step t. For t = 0 the initialization 
fx ,0 = fx ,0 is used. 



4.3 Nonlinear Refinement 



Since we use only slightly more points than the minimum number required for calibra- 
tion, nonlinear refinement of camera parameters with the linear estimation as initializa- 
tion is absolutely essential. Optimization is done here for the extrinsic parameters only, 
while the intrinsic parameters from the previous maximum-likelihood estimation are 
used and held constant during nonlinear refinement. For this purpose the Gauss-Newton 
algorithm with Levenberg-Marquardt extension (see d for details) is utilized which 
computes a new estimate of a parameter vector a using a local parametrization Aa by 
dfe_|_i = dfc -P Aa where 

Aa = - {XI + J'^ J)~^ e{ak) . (4) 



This method minimizes the mean square error e^e, where e is a residual function that 
computes in our case the (non-squared) back-projection error between each image point 
{xi, Hi) and the projection qi = (g^x, of its corresponding 3-D point Wi ob- 

tained by equation Q): 



e = 



Qlx Qly Qnx Qny \ 

Xl ,yi ,yn I 

Qlw Qlw Qnw Qnw / 



(5) 



J is the Jacobian of the first derivatives of e evaluated at a^: J = Since the 

matrix inversion in equation may be numerically instable due to a nearly singular ma- 
trix J^J, the factor A is introduced in the Levenberg-Marquardt algorithm and adapted 
during each iteration. One Levenberg-Marquardt iteration comprises the following ac- 
tions: Computation of a parameter update using equation (E) as well as the resulting 
back-projection error, acceptance of the new parameters if the error is smaller than the 
error after the last iteration and division of A by a factor of 10, or rejection of the com- 
puted parameters and muliplication of A by a factor of 10. Since the error may increase 
during one iteration due to instabilities in matrix inversion, the preceding steps are done 
until the new parameters yield a smaller error than at the end of the last iteration. The 
parameter vector a contains the 3 components of the translation t plus 3 components 
parametrizing the rotation matrix R, which has 9 elements but only 3 degrees of freedom 
(DOF). A numerically stable parametrization should be used, i. e. either the axis/angle 
representation or quaternions which are both a fair parametrization of rotations in the 
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sense of (S], while Euler angles are not. A rotation matrix can be represented by a 3- 
vector r = (ri, r 2 , rs)^ giving the direction of the rotation axis with 2 DOF plus the 
rotation angle 0 encoded as the norm of r. The corresponding rotation matrix R can be 
calculated by Rodrigues’ formula 0: 



R = h + 



sin 0 

~1T 



/ 0 -ra T2 \ 
ra 0 -ri 
\-r2 ri 0 / 



1 — cos 9 
02 



/ 0 -ra T2 \ 
ra 0 -n 

\-r2 ri 0 J 



(6) 



When using quaternions for nonlinear optimization it is necessary to consider that a 
quaternion representing a rotation has 4 elements but only 3 DOF, since it must be 
normalized to 1 . The Levenberg-Marquardt algorithm cannot deal with constraints on 
the parameters and it must be guaranteed that the norm of a quaternion is always 1 
during optimization. In order to accomplish this goal a quaternion parametrization at the 
operating point using only 3 elements was introduced in 11 311411 . 



4.4 Detection and Treatment of Outliers 

For different reasons the virtual object in the resulting augmented image may be ren- 
dered in a completely different pose than the cube. Most of the time this is not due to 
calibration errors or badly localized cube comers, but to missclassification of the colors 
painted on the sides of the cube. Additionally there are images where no cube could be 
found at all. Both cases lead to visually inacceptable results and hence have to be dealt 
with in an appropriate way. For detection of a non-valid pose, we use thresholds for 
change in rotation (measured in distance between vectors in axis/angle representation) 
and translation with respect to the last valid frame. 

The easiest solution is to keep the pose of the last valid frame. This is acceptable 
when only a few frames have to be dropped. Otherwise a prediction of the cube’s move- 
ment would yield a better result for the human observer. For this purpose we use linear 
prediction, a technique classically applied in speech recognition |9'|. Linear prediction 
estimates the n-th value of a sequence of discrete values gj using a combination of the 
preceding k values as follows; 



k 

9n — ^ ^ Oii9n—i ■ (7) 

i=l 

Having a long enough sequence, one can estimate the unknowns ai from a linear system 
of equations using e. g. the SVD which minimizes the mean square error. The Ui are in 
turn used for predicting new values of the sequence. In our case we predict the elements 
of the translation vector and of the rotation matrix in axis/angle representation. 



5 Experiments 

In our experiments we used images of size 360 x 288 pixels. Two results of augmentation 
are shown in Fig.|2l More images and video-sequences are available at Q. 
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Fig. 2. Two examples of augmentation: Original (top left), CSC segmented cut-out (top middle), 
and augmented image (top right) on a turntable; original (bottom left) and augmented image 
(bottom right) of cube held in hand. 

We found that the part most prone to errors is the color classification due to different 
lighting conditions. Missclassification of a cube side leads to wrong 2-D/3-D assignment 
and thus to unusable calibration results. The color classifier was trained using 577 to 
885 samples per color. The recognition rate for the whole cube ranges from 74% - 93%, 
depending on the illumination of the scene. The mean detection accuracy (measured 
by hand) of the cube’s corners is about 1.3 pixels. If Tsai’s calibration method is to be 
used, additional point correspondences must be computed when only six points have 
been detected which leads to fairly good visual results. When using the other method, 
simulations showed that calibration can be done accurately enough even with only six 
point correspondences, while computation of additional points yields a worse back- 
projection error and worse camera parameters. 

The system is still working off-line, but since we have a real-time system in mind 
we want to give an impression of the computation times needed for one frame on a 
800 MHz Pentium III. Using the Windows program of lTf2l . the CSC took 40 to 130 
msec for two passes, one on the reduced size image and one on the cube cut-out, the 
given time depending on the number of possible cube-regions found in the reduced-size 
image. Additional 150 to 250 msec are needed for color classification of the segmented 
regions and computing the corners. The time needed for calibration is dominated by non- 
linear optimization; depending on the number of Levenberg-Marquardt iterations done, 
computation time varied from 10 msec (5 iterations) to about 30 msec (20 iterations). 
OpenGL rendering takes additional 30 msec. 



6 Conclusion and Future Work 

In this contribution we present an Augmented Reality system using a color segmentation 
approach for localizing a metal cube in an image which can be replaced by an arbitrary 
computer-generated object. Camera calibration with a minimum number of point corre- 
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spondences works well and is fast enough for real-time applications even when nonlinear 
refinement is done. Topics for further improvement are the speed and accuracy of cube 
localization and corner detection. We also want to consider illumination estimation using 
information on the cube’s colors in the future. 
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Abstract. In this work we describe a set of visual routines, which support a 
novel sensor free interface between a human and virtual objects. The visual 
routines detect, track and interpret a gesture of pointing in real time. This is 
solved in the context of a scenario, which enables a user to activate virtual 
objects displayed on a projective screen. By changing a direction of pointing 
with an extended towards the screen arm, the user controls the motion of virtual 
objects. The vision system consists of a single overhead view camera and 
exploits a priori knowledge of the human body appearance, interactive context 
and environment. The system operates in real time on a standard Pentium-PC 
platform. 



1. Introduction 

At the beginning of the 1 990’ s there was an explosion in creating physically interactive 
environments where users could explore a virtual environment or interact with a 
character. A key task for vision systems supporting such interaction is to detect and 
interpret human actions as they appear in imaging data. In this paper we focus on the 
visual analysis of a gesture of pointing. 

A number of systems have been developed that use interactive gesture recognition. 
These are relevant for the present work in that the system has to recognise that the 
gesture occurred and extract a parameter important to the interaction. The ALIVE [4] 
and Perseus [10] systems are examples. The approach of these systems is first to 
identify the static configurations of the user’s body that are diagnostic of the gesture, 
and then use a separate method to extract the gesture parameters. In the present work 
we refer to the notion of parameterised movements [15], meaning the movements that 
exhibit meaningful, systematic variation. 

The ability to follow people’s action in real time is an important component of all 
these systems. Progress in tracking of human movements was reported in [1], [3], [8], 
[14] for various conditions, such as static background, periodic motion, stereovision, 
and gesture-initialisation. The main principle of all these schemes, which is also used 
in the present work, is that the body is segmented into parts, which are tracked 
independently. The visual tracking often exploits the knowledge of the context of the 
specific tracking task [13] or enforces the kinematics constraint on the body motion 
[7]. Statistical models associated with different body parts are often used for their 
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tracking [16]. In this work we use a closed-world assumption [8], which means a 
region of space and time in which the specific context of what is happening in the 
region assumed to be known. 

The vision system presented here supports specific interaction between a person 
and virtual objects. In the scenario a person observes a 3-D virtual environment 
projected onto the screen, selects a particular object by pointing at it with an extended 
arm and controls its motion by changing direction of pointing. Our philosophy has 
been to integrate those complementary and inexpensive vision modules into an 
integrated system, which is likely to perform well in the given context and 
environment constraints. 

2. Visually Driven Interface and the Environment 

One of the most frequently used and expressively powerful gestures is pointing. In the 
scenario a user can move and rotate an object across the screen. The object would 
follow the motion of the extended toward the screen arm as it points in different 
directions. We are only interested in the horizontal coordinate of pointing as virtual 
objects could move and rotate either to the left or to the right depending on the side of 
the screen the user pointed at. This consideration allows using a single, overhead view, 
greyscale camera for the vision system supporting the interaction. The pointing 
gesture, viewed by the overhead camera, generates salient visual cues, which are easy 
to detect and interpret. 

In the system the user moves freely in a real-world space of approximately 6 by 3 
meters. A virtual scene is projected on a wall of 3m bright and 2m height. A 
monochrome CCD camera, which is placed at the ceiling of the space, captures the top 
image of the user with the standard video frequency rate of 25 frames per second. The 
pointing gesture is defined to occur when the user extends either of arms towards the 
screen. Due to the context and space geometry, direction of pointing is restricted to 
one hemisphere in front of the user and it is parameterised relative to the projective 
screen. Pointing is modelled by a 3-D line connecting a middle point on the user’s 
head and the tip of the pointing hand. Reconstruction of the pointing gesture is then 
equivalent to finding the horizontal and the vertical coordinates of a point where this 
line meets the screen. In this work we find only the horizontal coordinate that 
parameterises pointing. 



3. Segmentation of a Human Body 

The first task to be solved by the system is segmentation of the user in the camera 
image. This task is normally difficult, but the static room environment that is valid in 
our case and the fixed camera allows using background subtraction. However, due to 
non-uniform lighting, a shadow cast by the user on the floor may cause problems. 
Previous attempts at segmenting people from a known background have taken 
advantage either of colour cameras [2] or multiple cameras [9]. Having a single 
greyscale camera, segmentation of the user is done by background subtraction, 
followed by morphological operations to remove noise. 

Background subtraction. We have observed that for robust segmentation of a human 
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body a dark colour of the floor has an advantage over a bright one. Firstly, it 
suppresses the user’s shadow; secondly, it significantly diminishes random brightness 
variations of those pixels that contribute to isolated noise in the segmented image. 
Segmentation tends to be more stable if threshold values applied to the difference are 
attuned to overtime brightness variations at each image pixel. A threshold array of the 
size of the image is computed off-line. Typical brightness variations are observed at 
each pixel during live image acquisition. The threshold values are set to the maximum 
variation recorded at each pixel during about 15 minutes of acquisition. Segmentation 
process starts by acquiring the static background of the room. Smoothing this 
background image with the averaging operator in a local 3x3-pixel window generates 
a reference image. The same local smoothing is applied to all frames. The difference 
between the reference image and each smoothed image from the frame sequence is 
computed. Final segmentation is obtained by thresholding the difference using the 
threshold array. 

Morphological analysis. Human body segmentation is followed by a morphological 
analysis, to ensure that all noisy areas are removed, and only the body segment 
remains in the segmented image. The morphological operator is applied to the binary- 
segmented image in order to extract all of its connected components. Each component 
is then labelled and its size is recorded. Final segmentation is obtained by eliminating 
all but the largest connected component, what we consider to be the human body 
segment. The process of extraction of connected components in the binary image 
consists of the two steps. The first Run length encoding of a binary image step, 
transforms the binary-pixel values of the image into a list of encoded line segments 
that contain only non-zero pixels. The second step, called Neighbourhood analysis of 
line segment uses the list to identify all connected components in the binary image and 
label them by its unique Identification number. The algorithm has an 0(n) complexity, 
where n is the number of encoded line segments. This method gives reasonably stable 
segmentation of a person without shadow. 

4. Modelling the Gesture of Pointing 

Once the body segment has been isolated from the background, we label critical body 
parts. We use explicit a priori knowledge of a human body appearance in the overhead 
view image, and also the given context and room geometry. As we know to which side 
the screen appears in the image, the pointing arm is restricted to a certain angular 
sector centred at user location. We proceed by drawing a bounding box around the 
isolated body segment. The overhead view of the body segment in the box has specific 
features: a) A prominent head crowns the shoulders while leaving the rest of the body 
being almost occluded, b) This typical body shape changes dramatically if the user 
extends one of his arms towards the screen: the body size increases about twice in the 
direction of the screen. We use the size of the bounding box to classify the current 
activity of the user, which is either “non-pointing” or “pointing”. If the box increases 
by more than 50% relative to its original size, the gesture of pointing is detected and 
the human body is segmented into two parts: 1) the arm/hand segment, and 2) the 
head/shoulders segment. Next, both segments are enclosed into a separate bounding 
box. The pointing gesture is modelled by a line of sight connecting the middle pixel on 
the user’s head and the tip of the pointing hand. Taking into account real time 
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processing requirements we compute a centre of gravity of the head/shoulders segment 
to approximate the most probable location of the middle pixel on the head. The above 
process is illustrated in Fig. 1 . 




Fig. 1. Classification of the body parts and construction of the line of sight in the image. 

5. Tracking of the Gesture of Pointing 

Real time interaction ought to trigger immediate modifications in the virtual scene in 
response to motion of the gesturing arm. We use tracking to follow the gesture of 
pointing over time. We track two points of interest, which define the line of sight, i.e. 
the middle pixel of the head and the tip of the hand used for pointing. It is faster to 
track these two points in the image sequence than to segment each frame. We extend a 
Flierarchical Feature Vector Matching (HFVM) [11] algorithm, by computing an 
object associated feature vector of a pixel. A feature is a real number, which 
numerically describes statistical information of the neighbourhood of pixel location. 
Each object in the image has a data structure that characterises its estimated size, 
shape, directional intensity distribution, shape of edges, etc. We collect this statistical 
data into the object feature vector of the pixel associated with this object. These 
feature vectors are then compared during the matching process. Here we exploit 
closed-world assumption i.e. no new objects are expected to appear in the search area. 
We match feature vectors in images with original resolution in rather restricted search 
area. If the search area is too small for the successful match of the head or the hand 
object, the vision system aborts the tracking mode and returns to full frame 
segmentation. 



6. Reconstruction of the Direction of Pointing 

3-D interpretation of the pointing gesture is based on the knowledge of the internal 
camera parameters and room geometry. Below we describe off-line calibration of the 
vision system followed by real time reconstruction of the user’s pointing. 

Offline calibration of the vision system. There are two independent steps in 
calibration of the vision system: 1) obtaining the camera internal parameters and 2) 
finding camera location and orientation (camera pose) relative to the screen. We use a 
pinhole camera model, which is described by the perspective projection matrix P in 
the normalised coordinate system of the camera (see [5], page 57). We use the DLT 
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calibration method [5] to find internal camera parameters that define the matrix P . 
Camera pose is defined by the 6 parameters: 3 for rotation and 3 for translation. As we 
do not explicitly need all 3 angular values of camera rotation, we only find an optical 
ray, which is vertical to the room’s floor. To do that, a mirror is placed on the floor in 
such a way that enables to view the reflected image of the camera in the mirror. If 
pixel V (Fig. 2) is the middle pixel of the camera in its reflected image, then the optical 
ray <Cv> spanned by the camera centre C and pixel v is the vertical to the floor 
optical ray. 

Camera translation is defined relative to room geometry. The following parameters are 
either measured or computed off-line (Fig. 2): 

1. The altitude of the camera above the floor \CV\. 

2. The image line /, which is the image of the 3-D line of intersection between the 
screen and the floor. Note that I can extend beyond the image boundaries. 

3. The pixel w, which is the image of the 3-D point W such that a) it belongs to the 
lower screen boundary, and b) the line segment <WV> on the floor is perpendicular to 
the screen. 

4. The distance \ WV\ between the points V and W on the floor. 




Fig. 2. 3-D model of the pointing gesture. The double line in the image illustrates the line of 
sight. Pixel m is the point selected by the user on the screen. 

3-D Interpretation of the direction of pointing. To determine the horizontal screen 
coordinate that parameterises the direction of pointing, a two-step algorithm is applied. 
First, the pointing line in 2-D is constructed. In the second 3-D step we compute both 
perspective projection of the pointing line onto the room floor, and its intersection 
with the screen. These two steps are also illustrated in Fig. 2: 

2- D step. Direction of pointing is modelled by the line of sight (Section 4). The 
intersection point m of the line of sight and the line / is found. 

3- D step. The perspective projections of the image points’ v and m onto the floor are 

computed. These points are denoted by the capital letters V and M, respectively. The 
end point of the vector V on the optical ray <Cv> in the normalised coordinate system 
is given by: V = /IP ‘v . Here P is the 3x3 leftmost sub-matrix of P and 
V — V ^ ,ij is the homogeneous vector of v, and /I is a scalar. Given a similar 

expression for the vector M, the angle y between the optical rays <Cv> and <Cm> is 
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given by: cos 7 = iVl‘ V/|iVl||V| . As y < 180°, this definition is unique. The distance 
between the points V and M on the floor is obtained from the right triangle CVM and 
the measured altitude \CV\. Finally, the direction of pointing is defined by the distance 
I WM\ computed from the right triangle VWM. 

7. Experiments 

Many people have tested the interactive interface by pointing at virtual objects from 
different locations. The distance to the screen varied within the range of about Im to 
5m. Before entering the room subjects were only told that pointing with either of arms 
activates manipulation with virtual objects. We have observed that it takes about 1-2 
minutes before a naive user accommodates to the interface. The visual routines 
performing detection, tracking and interpretation of pointing run on a Pentium 
processor under GNU/Linux operation system. If images are digitised at a size of 
384x288 pixels the system is able to follow events at a rate of 15 Hz. 

Typically, background subtraction resulted in a binary image with about 20-30 
scattered segments with the average size of about 15 pixels. After all smaller segments 
were discarded, we obtained stable segmentation of the user body. Failures, such as 
fragmented body segmentation, have occurred for those users whose dark clothes 
looked similar to the floor background. Standard random noise did not disrupt final 
segmentation because it does not affect much the shape of the biggest segment. We 
have observed stable tracking both for the tip of the hand and the middle point of the 
head. The search area used for tracking of the head object was 3 times smaller than the 
one used for tracking of the hand tip. However, tracking of the head was lost in those 
cases when the user rotated the head noticeably. It is due to the fact that HFVM 
matching is not invariant to rotations above 15 degrees. Tracking of the tip point of the 
hand was stable as long as the user kept his arm pointing. 




Fig. 3. Accuracy of reconstruction of pointing for the three areas in the camera image. Area 2 
lies at about Im from the screen. Area 3 extends within distance range of about 3m to 6m from 
the screen. Best accuracy of reconstruction of about 4cm (1% of the screen size) is achieved for 
pointing from the central area 1 . The accuracy decreases for the peripheral areas 2 and 3 due to 
poor estimation of the middle pixel on the head: 

A vertical strip was projected onto the screen in position defined by the reconstructed 
horizontal coordinate. The accuracy of interpretation of pointing depends on position 
of the user in the camera image (Fig.3). Main source of inaccurate reconstruction 
comes from the unstable modelling of the line of sight. It occurs when the head 
occludes either of the user shoulders in the image. Table 1 presents results on system 
performance and Fig. 4 exemplifies the process of interaction. 
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Table 1. Performance of the system recorded over time in experiments with 12 subjects. 
Segmentation failed in 13% of the time. Given successful segmentation, failures for modelling 
of the line of sight were recorded in 3% of the time for the area 1 and in 7% of the time for the 
peripheral areas 2 and 3. We recorded failure if the line deviated from the correct direction by 
more than 5 degrees. Tracking failures were attributed only to head rotation. Overall 
performance of reconstruction shown in the last row was mainly affected by unstable 
segmentation. 





Area 1 


Area 2 


Area 3 


Segmentation 


87% 


87% 


87% 


Line of sight 


97% 


93% 


93% 


Tracking 


96% 


95% 


96% 


Overall 


85% 


83% 


83% 




Fig. 4. Three frames from the video sequence illustrating pointing driven rotation of the virtual 
printer. The vertical stripe indicates horizontal position pointed at the screen. 



8. Conclusions and Future Work 

We have presented a set of visual routines for perceiving the gesture of pointing in the 
novel sensor free interface between a human and virtual object. A user can move 
freely in the real space and activate motion of virtual objects by pointing towards the 
screen with an extended arm. The visual routines perform detection, tracking and 
reconstruction of horizontal direction of pointing in real time. The vision system 
consists of an overhead view greyscale camera. Camera calibration is obtained off-line 
by classical means using a 3-D calibration target. 

Our experiments have shown that the interactive environment provides its users with 
an easy insight on how to activate and control virtual objects. Detection of pointing 
operates well regardless of the size of the user’s body, because we use relative 
increase in the size of the bounding box as a sign that pointing occurred. Pointing is 
detected successfully in cases it is performed by half-bent arm. The accuracy of 
reconstruction of horizontal location on the screen is enough to provide comfortable 
interaction. 

Major limitation of the current implementation is that segmentation of the user’s body 
from the background is sensitive to the colour of clothes and hair. These weakness 
results from intensity based segmentation of grey scale images. Another limitation 
arises from simplistic computation of the middle pixel on the user’s head, which leads 
to somewhat unstable modelling of the line of sight in the peripheral areas of the 
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camera image. Both limitations are the price paid for the simplicity of the setup of the 
vision system, which includes one monochrome camera. In the future work we will 
use a stereo technique to reconstruct both the horizontal and the vertical coordinates of 
the direction of pointing. It will also improve the robustness of human body 
segmentation and the modelling of the line of sight. 
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Abstract. In this paper, we present a technique for the extraction of 
the main five visemes for German spoken language analysis from images. 
The intensity, the edges, and the line segments are used to locate the lips 
automatically, and to discriminate between the desired viseme classes. 
Good recognition rate has been achieved on different speakers. 



1 Introduction 

Automated speech perception systems are very sensitive to background noise. 
They fail totally when multiple speakers are talking simultaneously (cocktail 
party effect). Besides the acoustic signals from both ears, visual information, 
mostly lip movements, are subconsciously involved in the recognition process of 
human beings m 

Tracking of the lips in image sequences and relating the features of the lips to 
the speech are two challenging problems. This is due to the lack of dominant im- 
age features defining the lips |^. For lip location and visual features extraction, 
several methods were proposed in the literature. Deformable templates n ini 
are effective in lip tracking, though quite computationally expensive [S|. Princi- 
pal component analysis (PCA) was used [3|. But not each form of lips can be 
described by PCA, and a large data base of objects is needed A 3D model 
of the lips using two calibrated cameras was proposed uni- However, the human 
being can read the speech even from still 2D images. 

Here, we present a method for lips tracking and viseme classification using 
feature templates matching. Given a number of independent features of the sam- 
ple data, a linear combination of these is created which yields the largest mean 
difference between the desired classes For a certain speaker, the system is 
trained with five different visemes. The different features of the visemes are 
stored in templates. These templates are used to track the lips and to recognize 
the viseme. 

The paper is organized as follows. In sec.|2|we describe how the phonemes 
are related to mouth shape, and how they are divided into classes of visemes 
similar to the perception of the human being. In sec. 0 the use of the feature 
templates in locating the lips from the face of the speaker is illustrated. In sec.0 
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the classification of the features into the viseme classes is illustrated. In sec.Ela 
comparison between different sequences and the results are presented. 



2 Grouping Phonemes into Visemes 

The human being can relate the lip shape to groups of phonemes. This can be 
noticed especially when the lips are mainly used to produce a certain phoneme. 
For example, in the phonemes m p b the mouth must be closed. 

Dodd [7| divided the sounds or phonemes into 14 groups of visemes for En- 
glish. A reasonable grouping was supposed by Abry The phonemes in French 
were subdivided into six groups according to the lips’ shapes. We subdivided the 
phonemes into five groups depending on the lip shape. These are 

1. C: closed lips m, b, p, ... 

2. O: open lips a, h, ... 

3. H: half open lips t, d, k, g, s, ... 

4. L: long lips i, e, ... 

5. R: rounded lips o, u, ii, d ... . 

We noticed that this is not always the case for all the speakers. The boundaries 
between the visemes are not always clearly defined for each speaker. Thus, the 
number of the viseme classes can be even less than 5. Fig. E shows an example 
of five distinguishable viseme classes. 




Close Open Half Long Round 

Fig. 1. An example of images of five viseme classes. 



3 Tracking of the Lips 

A pattern image representing the lips is correlated with the speaker’s image pixel 
by pixel, see Fig. 0 The speaker’s image and the pattern image are scaled down 
to speed up the search. 

The areas where the correlation is greater than a certain threshold are 
marked. Due to the scale and rotation in the lips’ image and the variable light- 
ing condition, the position of the absolute maximum correlation is not always 
the correct position of the lips. The eyes sometimes have correlation coefficients 
larger than that of the lips. 
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To avoid getting stuck with non lips’ locations, we inspect additional features 
of the lips. The edge image is calculated and fitted with 10 pixel line segments. 
Distinguished features of the lips are the corners of the mouth. Along the line 
segments, we search for the corners of the mouth pattern the same way that was 
used for the lip pattern. Using these two corners, we rotate and scale the image 
of the lips to be in the standard position for further viseme recognition. Fig. 0 
illustrates this procedure. In sec. ^Jfurther refinement will be described. 

The region of interest ROI is defined as 150% of the lip size at the current 
lips’ location. The process is restricted to the ROI if the correlation coefficient 
p is less than a certain threshold value at least every 20 images. 




Fig. 2. Searching for the lips, left the searched pattern, middle the pattern matched, 
right regions where p > threshold. 




Lip image Edges Line Segments 




Comers fitting Rotation Standard position 



Fig. 3. Searching for the corners of the mouth in ROI, and putting the lip in standard 
position 



4 Viseme Recognition 

Different visemes are very close to each other in feature space. Only very small 
variations exist between them. This makes the recognition a non-trivial task. 






440 I. Shdaifat, R.-R. Grigat, and S. Liitgert 



Fig.il shows a diagram of the five viseme classes and the large overlap between 
them. 

In secEl the image of the lips was detected, and transformed into the standard 
position. The resulting image of the lips is compared with each viseme template. 
FigEl shows an image of closed lips and the basic features that are used in the 
recognition. 





Fig. 5. Features used to discrim- 
inate between the visemes. 



Fig. 4. A hypothetical representation of 
the overlap between the different viseme 
classes. 



For intensity comparison the intensity correlation pi is used. Given two 
images I\ and l2, with N pixels each, and let I\ and /2, cfi,(J2 be the averages 

and the standard deviations of I\ and I2, respectively, then pi = — — 



For the edges Pe 

for each edge point in image Ii do 
if edge point exists in We x win- 
dow in I2 then 

counter = counter -|- 1 

end if 
end for 



For the lines pi 

for each line segment in image /i do 
if line segment exists in wi x wi 
window in I2 then 

if angle of line segment in Ii and 
in I2 are equal then 
counter = counter -|- 1 
end if 
end if 
end for 



Pe = 



counter 



N edge points of Ii 



counter 

N line segments of I\ 

The total recognition rate pt is the linear combination of pi, pe, and pi. 
Pt is calculated for each viseme template. The output viseme is the one with 
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maximum pt- If pt is less than a certain threshold value the search for lips is 
repeated again for the whole image. If pt is again below the threshold the the 
classification is rejected. 

5 Results 

We tested our method for four speakers. Every speaker read a German text. The 
duration of each sequence is 20 sec at 10 frames per sec (fps). The image size 
was 352 X 288, the output frame rate was about 5 fps on Pentium III 700 MHz 
using a desktop camera. The system was trained using representative phonemes 
of the five visemes. The normalized recognition rate is shown in the tables. The 
location of the lips was incorrect only for few frames less than 1% of the number 
of the whole frames. 

The quality of a lip tracking algorithm is indicative of how well it performs 
in the recognition of visemes or words dH. Table □ shows the percentage of the 
correctly estimated visemes for each sequence. For sequence 2, Table 0 shows 
the recognition of each viseme separately in detail. 



Table 1. Ratio of correctly clas- 
sified visemes using a single fea- 
ture and combined features 



Sequence 


1 


2 


3 


4 


Only Intensity 


55 % 


64 % 


40 % 


39 % 


Only Edges 


46 % 


43 % 


33 % 


30 % 


Only Lines 


21 % 


31 % 


25 % 


21 % 


Altogether 


61 % 


73 % 


59 % 


43 % 



Table 2. Viseme recognition for sequence 
2. The visemes in horizontal direction are 
expected (top row), in vertical direction 
they are the estimated (left column) 



Visemes 


C 


O 


R 


L 


H 


C 


55 % 


0 % 


0 % 


0 % 


0 % 


O 


5 % 


84 % 


0 % 


7 % 


35% 


R 


15 % 


0 % 


84 % 


4 % 


4 % 


L 


15 % 


12 % 


5 % 


60 % 


22 % 


H 


10 % 


4 % 


16 % 


30 % 


37 % 



For some of the speakers, it was difficult to distinguish between very close 
visemes e.g. L and H, or between C and L. This was the case in e.g sequence 3, 
where the L viseme was difficult to be detected. 

The speaker in sequence 4 has beard and moustache. This reduced the recog- 
nition rate, the viseme H was always classified as R or L. Table 0 shows the 
recognition rate after reducing the image size to 75%. This simulates the case 
when the speaker is moving away from the camera. The recognition rate was 
reduced due to the insufficiency of the lips features. 



Table 3. Recognition rate for 75% zoomed image. 



1 Sequence 


1 


2 


3 1 4 


[Correct visemes 


|41 %|49 %|34 %|29 % 
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6 Conclusion Future Work 

We have presented a method for lip tracking and recognition of the five main 
visemes in German. This can also be applied to other languages, however. The 
visemes were divided into five groups according to the shape of the lips in natural 
speaking. Three features of the lips were used for the recognition, which made 
the location of the lips and the viseme recognition robust. 

The method has been tested for different speakers, and has achieved a good 
recognition rate of the visemes of about 60%. The frame rate was 5 fps. 

Currently, visemes are extracted only from one image at a time. As the speech 
is determined by the absolute shape and the transition between the visemes, this 
will be taken into account in our future work, as well as the combination with a 
speech recognition system. 
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Abstract. This paper is an extended abstract for a keynote speech at 
DAGM 2001. The talk aims at describing different Augmented Real- 
ity (AR) applications and some of the challenges they bring for the re- 
searchers in computer vision, computer graphics and mobile computing 
Different applications developed at Siemens Corporate Research (SCRjj 
are used to illustrate both advantages and shortcomings of the existing 
AR system^. 



Augmented Reality (AR) is a relatively new field. Before 90’s the idea of 
real-time augmentation of one’s view of the world had come up mostly in terms 
of wishful and futuristic thoughts and/or science fiction. In different movies the 
imaginary characters were able to see location based information, messages or 
other information superimposed onto their view of the real world. The scientific 
community had to wait until early 90’s to see some of the first implementations 
of such ideas. In the following years, the more Virtual Reality (VR) failed to 
fulfill its promise of changing the way people were doing their tasks, the more 
AR was considered as a serious alternative. AR started to define itself as the 
solution to problems, which prevented VR to become as popular as expected. 
Virtual Reality failed to satisfy its users for many reasons including the high 
cost for: a) building the virtual model of the world, b) keeping the model up 
to date, and c) training the users to work in such unfamiliar and non intuitive 
environment. VR required its industrial users to create detailed models of the 
environment and to radically change their workflow. By definition, VR aims at 
integrating or immersing the user into the virtual world. In contrast, AR aims 
at integrating virtual elements into the real world of its users. This seems more 
attractive, since it does not require the user to build the whole world in order to 
be able to work. The working environment does not need to change dramatically. 

^ Other than the author, the following researchers also work on Augmented Reality 
at or in collaboration with SCR: Mirko Appel, Benedicte Bascle, Yakup Gene, Ali 
Khamene, Erin Mcgarrity, Matthias Mitschke, Frank Sauer, Mihran Tuceryan and 
Xiang Zhang. 

^ This extended abstract is not intended to be a review paper on augmented reality 
literature. The references provide the readers only with publications of Siemens 
corporate Research on or related to Augmented Reality. 
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The virtual element can be added incrementally in order to give user the time to 
adopt to the new workflow. In addition the user can often continue to work in a 
modifled environment without explicitly modeling the modiflcations. However, 
Augmented Reality poses new challenges and needs dramatic progress in multiple 
research flelds before achieving its ultimate goals. 

Augmented Reality attracts researchers from many scientific and technical 
communities. Other than software design and software engineering aspects of 
an AR system that I decided not to talk about in this short presentation, real- 
time pose estimation, rendering and tracking are three major aspects of any 
AR system. Therefore, researchers from computer vision and computer graphics 
background form a majority of the AR scientific community. In order to augment 
users view of the surrounding world one can choose one of the following four 
options: 

— project the virtual components onto the real world, 

— project the virtual components onto user’s retina, 

— observe the real world through a semi-transparent display visualizing the 
virtual components, 

— observe the world through video cameras and integrate the virtual compo- 
nents into the camera images. 

The first two options are quite interesting but less common than the last 
two. The first option is well studied and experimented in particular at University 
of North Carolina, Chapel Hill. The second option is provided by MicroVision 
Virtual Retinal Scan Display (VRD). This system projects low energy laser onto 
user’s retina. The third and forth options are available in different formats and 
are used in many academic and industrial research laboratories. 

Computer vision can play an important role in AR applications. Its primary 
role is probably in automatic real-time detection, tracking and pose estimation. 
A tracking camera can be attached to all different types of augmentation devices 
described in the previous paragraph. For the forth option, i.e. the camera see- 
through devices, sometimes same camera can be used for tracking and pose 
estimation as well as view augmentation. 

The Augmented Reality Group at Siemens Corporate Research was formed 
by a collection of researchers all coming from a computer vision background. 
Industrial as-built reconstruction for update of the CAD data and revamp and 
maintenance planning was the first motivation for our image augmentation ac- 
tivities. Fig. [H shows some of the results of CyliCon, our off-line calibration, 
reconstruction and augmentation software Once the indus- 

trial customers gets exposed to such results, their first question is “could we 
have these augmentations in real-time?” . This has a lot of applications. For ex- 
ample, power plants need to be inspected both on regular basis and after each 
plant modification and update. The inspectors need to get to the right compo- 
nents in large plants and access the necessary data. This brings new challenges 
for both computer vision and mobile/ wearable computing. In this talk I will 
discuss these challenges in detail. I use our vision based real-time localization. 
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navigation and augmented reality data access ISE], see Fig. |3for illustration. 
This system uses the real-time marker-based pose estimation and image aug- 
mentation, which we had previously developed for electronic sales support |24l 

see Fig. 13 The system runs on different mobile and wearable computers all 
equipped with optical cameras. I will talk about advantages and shortcomings 
of the currently available mobile and wearable computers in regard to their use 
for mobile AR applications. 

The majority of optical and video see-through augmented reality systems, 
which use cameras for tracking and localization, are using specially designed 
markers. In spite of many years of research on self calibration and structure from 
motion in both photogrammetry and computer vision, none of the existing AR 
systems can rely on this technology for robust and consistent feature-based pose 
estimation in real environment and in real-time. In this talk, I will describe this 
problem and the challenges it brings to the computer vision community in more 
details. Examples of the computer vision algorithms which are currently used 
in AR system will be given and their shortcomings and some possible solutions 
will be discussed. 

For calibration of optical see-through systems one of the challenges is to 
recover the projection geometry of each user’s eye. In these systems as well 
as in VRD systems there is no camera and the image is formed only on the 
user’s retina. In order to calibrate these systems one needs to model and recover 
the projection geometry of users eyes. We developed the Single Point Active 
Alignment Method (SPAAM) |2I| to calibrate optical see-through HMDs (this 
method is also applicable to the stereo case |2| and can be made more robust 
when a vision-based tracker is used ^). The first version of our optical see- 
through system used magnetic trackers to estimate users head position. Our 
latest version uses optical and infra-red tracking for this purpose (see Fig. E] 
for details of the our latest see-through systems). Since these systems augment 
users view of the world and not images, evaluation of such systems also poses 
new challenges. We describe a first proposal for evaluation of such AR systems 
in |2I7I6| . 

The most precise and reliable augmentation of video see-through HMD sys- 
tems, to our knowledge, is developed for medical applications. One of these 
systems is designed and developed at Siemens Corporate Research imnm- 
This system allows the physician to see CT, MR or Ultrasound images overlaid 
on patient during operation, see Fig. 0 This provides physician with a whole 
new way of visualizing medical images and relating them to the patient during 
the surgery. The physician can move his head freely and observe the augmenta- 
tion with no jitter. However, this impressive system cannot be used in practice 
unless computer vision and computer graphics can face the challenges this appli- 
cation poses for them. The main problem is the correct perception of the virtual 
objects behind or inside real objects. The first part of the problem is simple 
occlusion. For example, the physician should not see the medical image overlaid 
on his/her hand or instruments during the surgery. This is a difficult challenge 
for computer vision research. There is a need for precise and real-time detection 
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of physician’s hands and instruments. Note that both hands and instruments 
change shape and color during the surgery. The second part of the problem con- 
cerns the correct perception of virtual objects inside real objects. This poses even 
more difficult challenges for researchers in computer vision and in particular in 
computer graphics. How should we render a virtual medical data such as CT, 
MR or Ultrasound on top of optical views of the patients anatomy such that 
the viewer perceives them at the right depth? This is a difficult problem and it 
becomes much harder as the physician cuts the surface and open the patients 
during invasive surgeries. These are beautiful open problems that need to be 
addressed by computer visions and computer graphics communities H. 
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Fig. 4. The various components of our see-through system. A Sony XC-55BB camera 
with an infra-red filter is attached to one of the head-mounted displays (the video see- 
through display from Mixed Reality Labs and the two optical see-through devices the 
Virtual Retinal Scan Display (VRD) from Microvision and the I-glasses. The attached 
camera observes a set of retro-reflective markers for tracking. The camera is surrounded 
by a set of infra-red LEDs to illuminate the scene for optimal real-time tracking. 
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Fig. 5. Left image: video view of head phantom augmented with skull contour lines 
and the model of a tumor, segmented from a set of magnetic resonance (MR) images. 
Two of the MR slices are also shown. Right image: Video-see-through head-mounted 
display with additional tracker camera. 




Fig. 6. The top images illustrate the CAMC concept for real-time alignment of X-ray 
ad video images as well as our first prototype with the camera attached to housing 
of the X-ray source and the double mirror system. The bottom images show three 
examples of real-time merger of X-ray and video images. The precise alignment of 
visible and invisible lines on the left image demonstrate the accuracy of the overlay, 
while the other two images show the clear advantage of this new visualization for image 
guided surgery. 
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